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Abstract 

Sensitivity analysis for stochastic systems is typically carried out via derivative estimation, 
which critically requires parametric model assumptions. In many situations, however, we want 
to evaluate model misspecification effect beyond certain parametric family of models, or in some 
cases, there plainly are no parametric models to begin with. Motivated by such deficiency, we 
propose a sensitivity analysis framework that is parameter-free, by using the KuUback-Leibler 
divergence as a measure of model discrepancy, and obtain well-defined derivative estimators. 
These estimators are robust in that they automatically choose the worst (and best)-case di- 
rections to move along in the (typically infinite-dimensional) model space. They require little 
knowledge to implement; the distributions of the underlying random variables can be known up 
to, for example, black-box simulation. Methodologically, we identify these worst-case directions 
of the model as changes of measure that are the fixed points of a class of functional contraction 
maps. These fixed points can be asymptotically expanded, resulting in derivative estimators 
that are expressed in closed-form formula in terms of the moments of certain "symmetrizations" 
of the system, and hence are readily computable. 



1 Introduction 

The evaluation of performance measures for stochastic systems, be it using simulation or analytical 
techniques, often relies heavily on the underlying model assumptions. Typically, the computational 
goal is the expectation of a certain functional /i(-) that depends on the underlying random compo- 
nents, say, X = {Xi,X2, ■ ■ .), i.e. E[h(X.)]. The model assumptions on these random components 
intricately affect the quantity to be estimated. For example, in queueing networks, the expected 
queue length and workload depend on the distributional assumptions imposed on the sequence of 
interarrival and service times. In finance, the computation of option prices requires the transition 
probability distributions of the asset price movement and risk factors. Computations that are based 
on wrong assumptions on the underlying components can affect the output in various magnitudes. 

To address this issue, a vast literature has been written to assess the effect of model misspecifi- 
cation. One major assessment tool is sensitivity analysis, or more specifically derivative estimation, 
by calculating the effect on the performance measure due to an infinitesimal change of the specified 
parameter in the model assumption. The computation of such derivative estimators can be done 
via several techniques. The most straightforward method is to use finite difference and invoke the 
first principle of differentiation ^151 [53] . When the dependency of the parameter is in the functional 
h, under some regularity conditions, differentiation and expectation operators can be interchanged 
to obtain derivative estimators directly in expectation form; this is known as pathwise differenti- 
ation or infinitesimal perturbation analysis (see, for instance, [28J [TBI [IP])- On the other hand, 
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when the parameter is involved in the density function, the hkehhood ratio method (or the score 
function method) can be employed via differentiation of the density, resulting in an estimator in 
the form of an expectation that is weighted by the so-called score function (see [23^ I44j ) . Other 
techniques that seek to improve and generalize the above standard methods include conditioning 
and smoothed perturbation analysis |2l] , kernel estimator [38] , the push-out method [l6l ST] , 
and other modifications or combinations of the above [32l [52l [HH HSl [29]. The surveys |33], [16], 
§VII in [3j and §7 in [20j provide general overview on these different methods. 

All these derivatives estimators, both conceptually and practically, require to certain extent 
a parametric model to establish. Of course, when the model is known to belong to a particular 
family, they can be effective in the assessment. However, more often than not, we want to assess 
the effect on our estimate due to model discrepancy that potentially goes beyond a change in a 
specified parameter. For example, a communication network administrator wants to forecast the 
chance of capacity overflow, but may not know exactly if Markovian assumption holds for the arrival 
and service time distributions. A portfolio manager who wants to predict future return may have 
doubt about the lognormal assumptions made on the individual asset price movements. Assessing 
the effect of model deviations of this sort clearly goes beyond the classical parametric derivative 
framework. More importantly, in many instances in operations management, there is no natural 
guidance in deciding a "true" parametric model; rather, nonparametric methods based on past data 
are adopted. In these situations there is not even a parametric family to begin with, not to mention 
the issue of assessing the effect of model uncertainty via classical sensitivity analysis. 

With such motivation, we develop a notion of nonparametric derivatives that can assess, in 
a well-defined sense, how the performance measure changes under an infinitesimal change in the 
general model space. These nonparametric derivatives possess the following features: 

1. Parameter- free/data-driven. Minimal information about the underlying model is re- 
quired. For example, the distribution of the underlying random components can be known 
up to black-box simulation. They can also be known in the form of smoothed empirical dis- 
tributions based on collected data, or other primitive forms. In particular, no parameter is 
needed for any of our analysis. 

2. Robust /adversary. These estimators capture model discrepancies along the worst and best- 
case directions in the space of admissible models (which is typically infinite-dimensional). As 
a result, they dominate all classical parametric derivatives (in a sense we will discuss). 

3. Computationally efficient. To compute these estimators, no additional knowledge is re- 
quired other than the numerical generation of the random components and the functional 
of interest. In fact, the form of these estimators depends only on the moments of certain 
"symmetrizations" of the stochastic system, via a simple sequential conditioning scheme, and 
hence they are readily computable. 

Let us briefly explain our methodology for developing these nonparametric derivatives; the 
main idea and formulation will be presented in more details in Section [2] First, to properly deflne 
derivatives one needs a notion of distance. Under parametric assumptions, Euclidean distance is 
naturally employed as a measurement of distance. In the nonparametric situation, one can use the 
notion of statistical distance [SJ deflned between two probability distributions, such as Kullback- 
Leibler (KL) divergence |31j . to measure the closeness of models. However, without parametric 
assumptions, the space of admissible models is generally inflnite-dimensional; consequently, blindly 
shrinking the statistical distance to zero and applying the first principle of differentiation can lead 
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to different rate-of-change values that result from different directions of movement. The key, then, 
is to use a robust approach and find the natural steepest descent direction among all admissible 
directions. Doing so requires solving appropriately posted optimization problems, which give the 
best and worst-case directions of movement of model discrepancy, before shrinking the distance to 
zero. 

These optimization problems, which we will describe in Section [2j have strong connection to 
the robust control and optimization literature. Roughly speaking, the goal of this literature is to 
design the notion of optimum and to find optimal decisions in contexts where full probabilistic 
characterization is unavailable or intractable. One common approach is to consider max min ob- 
jective, where the minimum is taken over a class of models that is believed to contain the truth, 
often called the uncertainty set [35]. The choice of the class defining the uncertainty set can vary, 
depending on both belief and computational consideration. In particular, a class that has been 
extensively studied is deterministic robust optimization, where the model is taken as degenerate 
and the uncertainty set is a deterministic region (see for instance [SI [71 El Sj). When the model 
in consideration is not deterministic, the uncertainty set is usually taken as a region defined via 
statistical distance such as KL divergence. This approach has also been studied widely, particularly 
for control problems in engineering j33j , economics and finance [26l |271 |2Tl [37| , queueing [30] , and 
dynamic pricing [36 1. As a tool for model sensitivity assessment, our approach naturally takes the 
latter type, i.e. the discrepancy in model is probabilistic. Rather than searching for an optimal 
control policy, however, our goal is to assess the effect of model discrepancy. In this regard, [22j has 
considered a similar formulation and coined the technique robust Monte Carlo as the solution to a 
class of convex programs that arise in finance (the reason that Monte Carlo appears is because of 
a naturally defined importance sampling scheme for the dual problem; see (26) below). This solu- 
tion, as we will discuss, is a consequence of single-stage robust control and serves as an important 
building block for our development. 

Despite such a rich literature on robust optimization, the derivation of our results calls for 
new methodological development, for the following reason: the behavior of derivative estimators 
relies crucially on the local, not global, properties of the associated optimization problems. In fact, 
with the assumptions on h that we shall discuss, our posted optimization problems are in general 
non-convex and globally intractable. This issue is especially prominent in the case of multi-stage 
problems, when a sequence of random variables, all generated according to the same model, shows 
up in the evaluation of the performance measure; such scenario appears often in real applications 
(think of, for example, the sequence of interarrival times and service times in a queue generated in 
an i.i.d. fashion). Therefore, instead of directly solving these optimization problems, we leverage 
the stochastic interpretations on the underlying system to carry out local analysis. In particular, 
to characterize the steepest descent directions, we develop a new representation of the associated 
locally optimal changes of measure as the fixed points to a class of functional contraction maps. 
These fixed points in turn possess tractable asymptotic behavior as the involved statistical distances 
varnish, and support high order Taylor series type expansions. We shall highlight that, in addition 
to obtaining asymptotic expansions, the contraction characterization of the optimal changes of 
measure is of interest itself and gives a precise description on the optimal movement under the 
involved structural landscape. The study of these contraction operators and their fixed points 
will be central in our line of mathematical development and drive the key towards the intriguing 
"symmetrization" form of our derivative estimators. 

Lastly, our development in this paper relates to two lines of study in the statistics literature, 
one on the form of our estimators and the other on our formulation; we believe that a deeper 
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understanding in both connections, with the goal of integration with real data, is important and 
shall comprise future research. Here we briefly discuss the relations and the differences between our 
development and these lines of study. The first one is the asymptotic distribution theory of von 
Mises differentiable statistical functions (§6 in [48]). This theory studies the rate of convergence and 
the normality properties of statistics that can be expressed as functionals of empirical distributions. 
There, the main development involves expansions using Prechet derivatives that reduce the problem 
into analyzing some [/-statistics of the individual samples, whose asymptotic moments turn out to 
bear a "symmetrization" form similar to this paper. In the related literature of robust statistics, 
such symmetrization form can be interpreted as the so-called "influence curve" that measures the 
effect to a statistic due to contamination at each arbitrary point in the space |25) . Methodologically, 
however, the derivation in von Mises' theory crucially utilizes Levy's continuity theorem and spectral 
analysis, in contrast to the optimization and change-of-measure approach proposed in this paper. 
Such methodological difference arises as von Mises' theory concentrates on finding central limit 
asymptotics for statistics of interest as data size grows, whereas our focus is on quantifying the 
effect of model discrepancy using a robust formulation without a priori data structure. We shall 
leave the full exploration of this connection in future research, which we believe can be fruitful in 
gaining further understanding on data-related robustness properties of some important performance 
measures in the operations research literature. 

The second related line of study is the empirical likelihood (EL) method [l2] . This is the revisit 
of classical inference problems, such as hypothesis testing, that uses empirical distributions instead 
of parametric models. Because the empirical distribution is the maximum likelihood estimator in 
the space of distributions, an analog of Wilks' Theorem (p2| P- 92) establishes that the associated 
logarithmic likelihood ratio between the empirical distribution and the true hypothesis (known as 
the profile likelihood) is asymptotically chi-square. Consequently, the computation of confidence 
region for a certain model parameter (such as moment) can be formulated as a constrained opti- 
mization problem with the associated critical values. The constraint of this optimization problem, 
which involves the so-called empirical likelihood, is analogous to the uncertainty set in the robust 
optimization formulation, and finding the confidence region is analogous to finding the range of 
performance outputs between the best and worst-case directions of model discrepancy in our set- 
ting. Although the EL method involves quite different asymptotic calculations due to its focus on 
inference, it can potentially be combined with our nonparametric derivative machinery to construct 
asymptotically valid confidence intervals for simulation estimators that are subject to input model 
uncertainty; we shall provide some discussion of this connection in Section [8| 

This paper is organized as follows. First, we explain our main idea and formulation in Section 
[2| We then state our main results in Section [3j Methodological development will be presented in 
Sections |4] and [5} followed by a discussion on the connection to parametric models in Section |6| 
Section [7] will then demonstrate our methods in some numerical experiments, and Section |8] will be 
devoted to a discussion of future work. We defer some technical proofs to the appendix. 

2 Main Idea and Formulation 

To explain our framework, let us call a real-valued function h{-) the cost function for convenience, 
and we are interested in the performance measure E[h(X.)], where X = {Xi,X2, . . .) is a (potentially 
infinite) sequence of i.i.d. random components each lying on the domain X. The cost function h 
and the domain X can be quite general. For example, h(X.) can be the waiting time of the 100-th 
customer in a queueing system, where X is the sequence of interarrival and service times. 
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Our nonparametric sensitivity analysis starts from a benchmark model that is believed to de- 
scribe the random components Xi, X2, ■ ■ ., and is used in the primary computation. Denote Pq 
and Eq as the probability distribution and the expectation under this benchmark model. We call 
£'o[/i(X)] the benchmark performance measure. The goal is to assess the effect on EQ[h(X.)] when 
there is an infinitesimal change in the model, i.e. the distribution Pq. 

As discussed in the introduction, to properly define derivatives in nonparametric settings, we 
use the notion of statistical distance defined on probability distributions as a measurement of model 
discrepancy. Throughout this paper we shall use KL divergence, defined between two distributions 
Pi and P2, as 



D{Pi\\P2) := I log^dPi =^2 



dPi , dPi 



where dPi/dP2 is the likelihood ratio, equal to the Radon-Nikodym derivative of Pi with respect 
to P2 (assuming absolute continuity), and E2[-] is the expectation under P2. The choice of KL 
divergence has several analytical advantages over other statistical distances and streamlines our 
development, as will be seen in the next section. Although KL divergence is, strictly speaking, not 
a metric, it serves as a measurement of distance in that two distributions possess similar statistical 
properties when they have a small KL divergence [51]. Supposing there is a slightly perturbed 
version of the benchmark model Pq, say Pf, so that D{Pf\\PQ) ~ 0, indeed one would expect that 
£'/[/i(X)] K, £'o[/i(X)], which provides a basis for a rate-of-change analysis. 

The main issue with plainly using the first principle of differentiation, when shrinking D(Py||Po) 
to 0, is the dimensionality: the space is now the set of probability distributions that are absolutely 
continuous with respect to Pq, which is typically infinite-dimensional. As a result, moving along 
different directions in this space can lead to different rates of change in the performance measure. To 
tackle this issue, as discussed in the introduction, we shall adopt a robust approach by considering 
the most extreme directions in taking the derivative, namely the ones that give the highest and 
the lowest rates of change in the performance measure. More concretely, before shrinking the KL 
divergence D(P/||Po) to 0, consider the optimization problems 

max £'j[/i(X)] 

subject to P>(P/||Po) < ?7 (1) 
P/ ePo 

and 

min Ef[h{X)] 

subject to P)(P/||Po) < r? (2) 
P/ GT'o 

where Vq is the set of all probability distributions that are absolutely continuous with respect to 
Po, which comprises our admissible model space. The problems ([T]) and ([2]) calculate the maximal 
and minimal values of the performance measure when the perturbed distribution Pf is within r]- 
neighborhood of Pq. 

The key message of this paper is that when letting r/ go to 0, under mild assumptions on /i(X), 
the optimal values of ([T]) and ^ can each be expressed as 

Ef, [/i(X)] = E^[hOq] + Ci(Po, h, X) + C2(Po, /i, X)r? + • • • (3) 



where Ci , C2 > • • • is a sequence of coefficients that can be written in closed- form formula in terms of h 
and X under the benchmark distribution Pq. The coefficient C,i can be interpreted as the first order 
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nonparametric derivative, 2^2 as the second order, and so forth, as a mimic of Taylor series. The 
rest of this paper focuses on finding these coefficients, presented in a roughly increasing order of 
sophistication regarding system complexity. Also, let us concentrate on the maximization problem 
([T]), since the minimization counterpart ^ is merely a replacement of h by —h; we shall discuss 



this implication in Section 6.1 



3 Main Results 

We state our results on three levels, namely when the cost function h depends on a single variable, 
a sequence of i.i.d. variables on a fixed time horizon, and a sequence of i.i.d. variables on a random 
time horizon. 

3.1 Preliminary Results and Insights 

Suppose that the cost function h depends only on a single variable X £ X in the formulation Q, 
i.e. the maximization problem ([T]) is 

max Ef[h{X)] 

subject to D{Pf\\Po) < rj (4) 
Pf e To. 

We impose the following assumptions: 

Assumption 1. The variable h{X) has exponential moment in a neighborhood of under Pq, i.e. 
< oo for G (— r, r) for some r > 0. 

Assumption 2. The variable h{X) is non-degenerate, i.e. non-constant, under the benchmark 
distribution Pq. 

The first assumption is on the light-tailedness of h{X). This is in particular satisfied by any 
bounded h{X), which handles all probability estimation problems. The second assumption is to 
exclude the uninteresting degeneracy case. More importantly, it also ensures that the benchmark 
distribution Pq is not an optimal model, since non-degeneracy always leads to an opportunity of 
upgrading the performance output by rebalancing the measure. This in turn guarantees the validity 
of an ascent direction. In this single-variable scenario, we can get a very precise understanding of 
the effect of model misspecification when rj is small: 

Theorem 1. Suppose Assumptions^ and^hold. Denote ipW) = log £'o[e^^'^"'^-*] as the logarithmic 
moment generating function of h{X). When rj > is within a sufficiently small neighborhood of 0, 
the optimal value of Q is given by 

Ef4h{X)]=ij'{n (5) 
where (3* is the unique positive solution to the equation (3ij)'{l3) — ij){l3) = rj. In particular, we have 

Ef, [h{X)] = E,[h{X)] + ^2Var,{h{X)W/^ + l^M^™.^ + o{rj''^) (6) 

where VarQ{h{X)) is the variance ofh(X), and Kz{h{X)) is the third order cumulant ofh{X) under 
Po. 
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Below are some observations concerning Theorem [T] that can be shown in significant generaUty: 

1. First order derivative interpretation. First, from ^ it appears that the square root of KL 
divergence is a correct scahng of the first order model misspecification effect. The first order 
derivative of Ef* [h{X)] with respect to is 

which can be interpreted as the maximal first order nonparametric derivative with respect to 
KL divergence. 

2. Second order derivative interpretation. Similar to the first order derivative, the quantity 
{2/3){Ks{h{X)) /Varo{h{X))) can be interpreted as the second order nonparametric derivative. 
It can be written more explicitly in terms of higher order moments as 



1 

VaroiHX)) 



lEoHXf] - 2Eo[hiX)]Eo[hiXf] + ^mHX)]f 



Form of the derivatives. It turns out that the form ^y2VarQ{h(X)) is a general pattern: 
even in more complex systems, the first order derivative is always of the form ^y2VarQ(g{X)) 
where h is replaced by a function g that captures some kind of "symmetrization" of h. This 
symmetrization typically involves a sum of conditional expectations. In the case of random 
time horizon problems, the sum can be infinite (yet still computable). The second order 
derivative can involve an extra term that captures a second order "symmetrization" of h. 



These generalizations will be discussed in Sections 3.2 and 3.3 



4. Translation invariance. Both the first and the second order nonparametric derivatives are 
translation invariant in h{X), i.e. they retain the same values when h{X) is replaced by 
h{X) + c for some constant c. This is apparent from the formula in ^ since cumulants are 
invariant to constant shifting. It is also consistent with the intuitive notion that adding a 
constant to h{X) should not change its sensitivity analysis. This observation has implication 
in the development of the corresponding quantities in random time horizon problems in Section 

5. Sign of the derivative. The first order nonparametric derivative ^y 2V aro{h{X)) is always 
positive. This is because of the maximization formulation. The derivative is taken along the 
direction, among all the eligible changes of measure, that gives the largest increase in the 
expectation of the cost function. In general, when the formulation is a minimization, the sign 
of the first order derivative is negative while keeping the same magnitude, and the second 
order derivative is the same as the maximization counterpart. Under Assumptions [l] and 
[2| the only change in ([s]) for the minimization formulation is that /?* should be the unique 
negative solution of the same equation. 

6. Parametric dominance. Because of the optimization formulations ([T]) and Q, the magnitudes 
of our nonparametric derivatives always dominate any eligible classical parametric derivative, 
under a rescaling from Euclidean distance to KL divergence. This will be discussed in more 
details in Section 6.2, Example [l] below also illustrates this point. 



Before we move on, let us consider a toy example for illustration: 
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Example 1 (Gaussian mean). Consider the quantity EX, where X is a mean-zero Gaussian ran- 
dom variable with variance cr^. Then obviously EqX = 0. We will compare a classical parametric 
derivative with our nonparametric approach. First, suppose the mean of X is misspecified, but it is 
still Gaussian with variance a"^ , so that the mean is ^ ^ instead of 0, i.e. EfX = fj.. This model 
misspecification effect can be summarized by establishing that the first order parametric derivative 
of EX with respect to fi is 1. 



To quantify this movement of fi in terms of KL divergence, let us consider Pf as N{ii, a 
the Gaussian family. It is an easy exercise to derive the relation D{Pf\\P()) = /i^/(2(T^); 



D{Pf\\Po) = Eo[LlogL]=Eo 



exp . „ 



2 



2 



2^2 



where L = exp |^ (^Zji — | is the likelihood ratio between Pf and Pq, and Z denotes a N{0, cr^) 

random variable. Hence a movement of one unit of fi triggers a movement of ^'^ /{2a'^) units in 
KL divergence. In terms of the derivative with respect to the square root of KL divergence, the first 
order sensitivity of EfX is then 1 x \/2a = along the direction of movement in mean within 

the Gaussian family with variance cr^ . 

Going to our formulation, Theorem^ states that the maximal increase in the value of EX, over 
all admissible distributional movement, is 

Ef*X - EoX = ^y2Varo{X)r]^/^ + 0{t]) = V2arj^/^ + 0{t]). 

In particular, according to discussion point^ above (which will be justified rigorously in Proposition 
this should provide an upper bound for the sensitivity when confined to Gaussian model. Indeed, 
the first order nonparametric derivative is y^2Varo{X) = \f2o , which matches exactly with the 
parametric derivative along the movement in mean within the Gaussian family. 

In the coming subsections we shall generalize the above observations to more complex systems 
where sequences of random components are involved. 



3.2 Finite Horizon Problems and Symmetrization Estimators 

In this section, we consider cost functions that depend on not only one but a sequence of random 
variables. We focus on the case when the sequence is i.i.d. on a fixed time horizon. Namely, 
we consider £'[/i(Xt)], where Xt = {Xi, . . . ,Xt) € X'^ , each Xt being i.i.d., for the fixed time 
horizon T > 1. For convenience we also use X to denote a generic random variable with the 
same distribution as all the Xts. Our focus is to assess the model misspecification effect when 
the identical probability distribution of the XfS deviates from Pq. The formulation in ([T]) is now 
written as 

max E/[/i(Xt)] 

subject to D(P/||Po) < V (7) 
Pf G Vq. 

The function . . . , •) is now assumed to satisfy a majorization condition: 

Assumption 3. The cost function h satisfies |/i(Xt)| < Af(Xf) a.s., for some deterministic 

functions A((-), where each of the Kt{Xt)'s possesses exponential moment, i.e. £'o[e^^*^"^^] < oo for 
9 in a neighborhood of zero. 
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Assumption [S] is easy to check and holds in many contexts. In particular, it holds trivially when 
h is bounded. Some examples will be demonstrated in Section [7j 

Next, we also assume a non-degeneracy condition that is analogous to Assumption [2] this 
requires introducing a function g{-) := Q[h){-) where ^ is a functional acted on h and g := G{h) 
maps from X to M. This function g{x) is defined as the sum of individual conditional expectations 
of /i(Xr) over all time steps, i.e. 

T 

g{x) = Y,gt{x) (8) 

t=i 

where gt{x) is the individual conditional expectation at time t, given by 

gt{x) = Eo[h{XT)\Xt = x]. (9) 

Our non-degeneracy condition is imposed on g acted on the individual random variable X: 

Assumption 4. The random variable g{X) is non- degenerate, i.e. non-constant, under the bench- 
mark distribution Pq. 

Assumption [4] guarantees that Pq is not a "locally optimal" model, in the sense that there always 
exists a direction of model discrepancy that strictly dominates Pq. When T = 1, Assumption |4] is 
reduced to Assumption [2j 

We have the following result: 

Theorem 2. With Assumptions^ and^ in hold, the optimal value of ([t]) satisfies 

Ef* [/i(Xr)] = EqU^t)] + ^2VarQ{g{X)W^ + y^^^^g^x)) (^'^^^^^^ + ^) ^ + 0{r("^) (10) 

where VarQ{g{X)) and K^{g{X)) are the variance and the third order cumulant of g{X) respectively, 
and 

V = Eq[{G{X,Y) - EQ[G{X,Ymg{X) - EQ[g{Xmg{Y) - EQ[g{Y)])]. (11) 
Here g{-) is defined in and G(-, •) is a function derived from h that is defined as 

T 

G{x,y) = Y, E Gts{x,y) (12) 

t=l s=l,...,T,s^t 

where 

Gts{x,y) = EQ[h{^T)\Xt = x,Xs = y]. (13) 
Here X and Y are independent random variables each having distribution Pq. 



The first and second order coefficients in ( 10 ) possess all the properties of nonparametric deriva- 
tives discussed in Section |3.1[ Additionally, we note the following: 

1. The function g[-) can be interpreted as a "symmetrization" of the multivariate function 
/i(-, ...,•), by collapsing the domain from to X via sequential conditioning. This function 
g{ ) plays an important role in controlling the associated optimality conditions and comes up 
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naturally due to the i.i.d. assumptions on Xi, . . . ,Xt, as will be seen in our development in 
Section [5] Besides, an alternate way to interpret g{-) is to define a function Sh ■ — )• M as 

Sh{xi, ...,xt) 

= h{xi,X2, . . .,Xt) + h{x2,Xl,X3, X4, . . . , Xt) + h{x3,X2,Xl,X4, . . . , Xr) H 

+h{xT, X2,X3, ... ,Xl) (14) 

i.e. Sfi is the sum of all the swaps from the first coordinate with each of the other coordinates. 
Then g(x) is merely Eo[Sh{^T)\Xi = x]. In the case that h is symmetric over all its arguments, 
5h(Xr) becomes Th(X.T), and g{x) = TEo[h(X.T)\Xi = x]. 

2. The function G{x, y) again has a "symmetrization" interpretation, by conditioning on two 
random components in each summand. Moreover, in the formula (11), one can replace 
Gix,y) = J2j=iJ2s=i,...,T,s^tGtsix,y) by either 



T 



T 



t=l s<t t=l s>t 

by simple arithmetic and replacing the roles of X and Y for some summands. Furthermore, 
the formula (11) can be written as 

u = Eo[G{X, Y){g{X) - Eo[giX)]){giY) - Eo[g{Y)])] 

by using the independence of X and Y. 

3. The first and second order nonparametric derivatives are translation invariant in both g{x) 
and G{x, y), inherited from the translation invariance of cumulants and the centering for each 
factor in the expression of i' in (11). 

Theorem [2] is illustrated by the following example: 

Example 2 (Tail probability of i.i.d. sum). Consider computing the probability P (^J2t=i > U 

where Xt are i.i.d. Gaussian variables with mean and variance a"^ . Suppose there is a change in 
the mean, within the Gaussian family, from to fi ^ 0. By the likelihood ratio method (see, for 
example, §7 in since (d/d/x) log (/>(x; /i, cr^)|^=o = xja^, where cl){x; fi, a'^) denotes the density 
of N{iJ,,a'^), the first order parametric derivative with respect to fj, is 



dfj, 



Eo 



I 



\Xt>y 



/ T 

E 

\t=l 

Eo[N{0,Ta^);N{0,Ta^)>y] 
^2 




Eo[T.LiXt;T.LiXt>y] 

a2 



To compare with our nonparametric framework, let us first convert the above calculation into a 
scale in terms of KL divergence. We know already from Example^ that the KL divergence between 
N{0,a'^) and N{iJi,a'^) is ji^a^), which means a unit change in ji is equivalent to y/2a change in 
the square root of KL divergence. So the first order derivative with respect to the square root of KL 
divergence, along this particular direction, is 



V2 



a 



Eo[N{0,Ta^y,N{0,Ta^)>y] 



(15) 
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Now let us consider a change in the model beyond the Gaussian family, but still retaining the 
i.i.d. assumption. Note that As sumption^ holds trivially since the cost function I{Ylt=i-^t > u) is 

an indicator function and hence bounded. Next, we have g{x) = Ylt=i ^0 (Y1s=i -^s > ul^t = = 

TPq (^J2s=i -^s > y\Xi = (alternatively, since I (^Ylt=i -^t > .symmetric, by discussion point 

\^ above, g{x) = T EQ[h{lLT)\Xi = x\ giving the same conclusion). The random variable 



T 



g{X)=TPolx + J2Xt>y 



X =T^y-X;0,(T-l)a^ 



\ t=2 

where 0, a^) denotes the tail distribution of N{0, o"^) and X ~ A''(0, a'^), is clearly non-constant. 
Consequently, Assumption^is satisfied. Moreover, the first order nonparametric derivative for the 
tail probability is 

^2Varo{g{X)) = T J 2Varomy - X;Q,{T - l)a^)). (16) 



According to discussion point^ in Section 3.1, the value of (16) must dominate (15). For 



example, when y = 10, T = 5, and a = 2, the parametric derivative with respect to the mean. 



given by (15), is 0.104. On the other hand, numerical estimation of the first order nonparametric 
derivative shown by (16) is 0.131 (within 10~^ error with 95% confidence, u.sing samples). When 



y = 10, T = 10 and a = 1, the parametric derivative (15) is 0.012 and the nonparametric derivative 
is 0.015 (within 10~^ error with 95% confidence). 

We caution that although a sum of Gaussian random variables Xi + • • • + Xt is still Gaus- 
sian, which allows straightforward evaluation of the probability of interest as well as the parametric 
derivative along the movement in mean, a perturbation of model beyond the Gaussian family for 
the summand variables will destroy their infinite divisibility property |T^ . In this sense there is no 



shortcut to obtain the formula (16). 



Intuitively, in our nonparametric derivatives in Theorem [2| the moments of the sum of sequential 
conditional expectations, each given a particular time step or pairs of time steps, can be interpreted 
as a summary of the variation due to all the underlying individual components. The form of the 
first order nonparametric derivative stipulates that its estimation can suffer from a linear growth of 
mean square error with the time horizon. This issue resembles that of the likelihood ratio method 
in classical parametric derivative estimation. For certain performance measures, such as those that 
require tracking the system's large deviations properties, it is plausible that variance reduction 
techniques can be employed, as in the parametric case [41j- Moreover, stationarity can potentially 
play a role in achieving an asymptotic mean square error of the estimators that is sublinear in T. 

3.3 Random Time Horizon Problems 

We now extend our result to problems involving a random time r. Consider the cost function h(X.r) 
that depends on the sequence X,- = (Xi, X2, . . . , X-r). Our formulation is then 

max Ef[h{X.r)] 

subject to D{Pf\\Po) < r] (17) 
Pf^Vo 



Let us lay out our assumptions on h and the random time r. We impose either a boundedness 
or an independence condition on r: 
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Assumption 5. The random time t is a stopping time with respect to {J't\t>i, a filtration that 
supersets the filtration that is generated by the sequence {Xt}t>i, namely {J-{Xi,...,Xt)}t>i- 
Moreover, r is bounded a.s. by a deterministic time T. The cost function h satisfies |/i(Xt-)| < 
Yld=i^ii-^i) ^-^^ for some deterministic functions At{-), where At{Xt) each possesses finite expo- 
nential moment, i.e. i?o[e^^'^^^] < oo, for 9 in a neighborhood of zero. 

Assumption 6. The random time r is independent of the sequence {Xt}t>i, o-nd has finite second 
moment under Pq, i.e. Eqt"^ < oo. Moreover, the cost function h(X-r) is bounded a.s.. 

Next, we also place a non-degeneracy condition analogous to Assumption [4j With r satisfying 
either of the assumptions above, we define g : X ^ R as 

oo 

~g{x) = Y,~9t{x) (18) 
t=i 

where gt{x) is given by 

gt{x) = Eo[h{Xr);T>t\Xt = x]. (19) 
Our non-degeneracy condition is now imposed on the function g acted on X: 

Assumption 7. The random variable g{X) is non- degenerate, i.e. non-constant, under the bench- 
mark distribution Pq. 

Either one of Assumption [5] or |6] suffices to extend our scheme in Section 3.2 to random time 
horizon. We have the following theorem: 

Theorem 3. With either As sumption^ or^in hold, together with Assumption^ the optimal value 
of (17) satisfies 

Ef, [h{y.r)\ = EQ[h{-Kr)\ + ^2Varo{g{X)W''^ + y^^^\~^x)) (^'^^^^^^ + ^ + 0(r?=^/') (20) 
where 

V = Eo[{G{X,Y) - Eo[G{X,Ym~g{X) - Eo[g{Xm~g{Y) - Eo[~g{Ym. (21) 
Here g{x) is defined in (18), and G{x,y) is defined as G{x,y) = Ylt^i'l2s>iGtsix,y) , where 

Gts{x,y) is given by 

Gts{x) = Eo[h{Xr); T>tAr\Xt = x, X, = y]. 



To get some intuition, note that the function g{x) introduced in Theorem[2]in the last subsection 
can be expressed as 



g{x) = J2Eo[h{:S.T)\Xt = x] = Eo[h{^T)\Xt = x]I{T > t) = ^ £;o[/i(Xt); T > t\Xt = x]. 

(22) 



t=i 



t=i 



t=i 



Of course, this calculation is only valid when T is deterministic. Theorem [3| however, states that 
we can replace T in (22) by a random time r under the proper assumptions. 
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Similar to Theorem [2| the fmiction G{x,y) in (21) can be replaced by 

oo oo 

2Y,^Eo[h{Xry,T>s\Xt = x,Xs = y] or 2 ^ ^ Eo[/i(X.); r > = x, X, = y]. 

t=l s<t t=l s>t 

Regarding computation, especially using Monte Carlo method, the fact that g{x) and G{x, y) 
involve infinite sums may call for some anxiety. One way to get around is to express the sums 
as expectations over infinitely supported random variables along with appropriate weighting. For 
example, in the case of g{x), we can introduce a random time R, independent of the system, with 
probability mass function p^^;, w = 1, 2, . . .. Then 

~g{x) = E^[p-^^h{^%T > R\X = x] 

where X:^ is with the ii-th component replaced hy X = x. In other words, one can simulate 
g{x) by the following procedure (with x fixed): 

1. Sample the auxiliary random time R\ say R = uo \s the realization. 

2. Sample Xi, . . . , Xr, with X^ fixed as x. 

3. If T < w, output 0; else output p~^h(K-j-). 

This yields a sample of p^^h(X.^)I{T > R) given X = x. The choice of the random time R 
can be quite flexible, as long as the involved expectation (and also second moment for the sake 
of consistency) is finite. This time horizon randomization technique has been used in devising 
importance samplers for first passage and steady-state problems in queueing networks and Gaussian 
processes (see, for example, [HI [8]). 



3.4 Auxiliary Variables 

In some applications, the cost function can involve many different random sources, and one is 
interested in evaluating the model misspecification effect from one particular source. Our framework 
discussed in the last three subsections can be easily adopted to such scenarios. Suppose there is 
a random object Y in the system, potentially dependent of {Xt}t>i, and we are interested in the 
model misspecification effect of the distribution of X for the performance measure Eq [h(X., Y)] . This 
random object Y can be a sequence of random variables, say Yi,Y2, . . ., or can be in other plausible 
forms. All the results discussed in the previous subsections, namely Theorems [T| [2] and |3} still hold, 
but replacing /i(X) by So[^(X) Y)|X] (Part of the statements in the theorems can then be further 
simplified). Such a direct replacement is enough because one can consider Ef[EQ[h(X.,Y)\'K]] as 
the performance measure and EQ[h(X., Y)|X] as the cost function, noting that Eq here involves only 
the random object Y which is not subject to perturbation. 



4 Explanation of the Preliminary Results 

In this and the following section, we will explain the main methodology in arriving at the theorems 
in Section [3j This section will be devoted to the basic case, namely Theorem [l| which will also 
serve as an important building block for the other results. 

To solve the maximization problem Q, we first transform the decision variables from the space 
of measures to the space of functions, which is easier to handle. Recall that Pf is assumed to be 
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absolutely continuous with respect to Pq, and hence the likelihood ratio L := dPf/dPo exists. Via 
a change of measure, the optimization problem Q can be rewritten as a maximization over the 
likelihood ratios, i.e. Q is equivalent to 

max Eo[h{X)L{X)] 

subject to Eo[L{X)logL{X)] <T] (23) 
LeC 

where C := {L £ £i(Po) '■ Eq[L] = 1, L > a.s. }, and we denote £i(Po) as the i2i-space with 
respect to the measure Pq (we sometimes suppress the dependence of X in L = L{X) for convenience 
when no confusion arises). The key then is to find an optimal solution L* , and investigate its 
asymptotic relation with rj. 

To solve (23), we consider the Lagrangian relaxation 

minmax£;o[/i(X)L] - a{Eo[LlogL] - rj). (24) 

Since the constraint is convex and the objective is linear, strong duality holds [39], and the optimal 
values of (23) and (24) coincide. Consider the inner maximization of (24) given by 

Tnsi^Eo\hiX)L- aLlogL]. (25) 
Lez; 

The solution for this inner maximization is characterized by the following proposition: 

Proposition 1. Suppose Assumption^holds, and consider a > 0. When a is sufficiently large, 
there exists a unique optimizer of ( |25| ) given by 

„h(x)/a 

L*{x) = — , (26) 



This result is known in robust single-stage control [27]. Let us review briefly here. A heuristic 
for getting (26) is to apply Euler-Lagrange equation and informally differentiate the integrand with 
respect to L. First, the Lagrangian relaxation of the constraint Eq[L\ = 1 is 

Eo[h{X)L - aL\ogL + XL - A] 

where A G M is the Lagrange multiplier. Treating Eq[-] as an integral, Euler-Lagrange equation 
implies that the derivative with respect to L is 

h{X) - a\ogL - a + \ = Q 

which gives 

, ^ hiX) A -a 

logL = -^-^ + 

a a 

or that L = Ae'^^^)/" for some A > 0. With the constraint that Eq\L\ = 1^ a. candidate solution is 
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The above argument is heuristic. To verify (27) formally, the following convexity argument will 
suffice. First, note that the objective value of (25) evaluated at L* , given by (27), is 

* (KX) 



h{X)L* -aL* l- 



a 



Eo[h{X)L* - aL* logL*] = Eq 

= alog^o[e''^^^/"] 

Our goal is to show that 

a log So [e''^^^/"] > Eo[h{X)L-aLlogL] 
for all L & C. Rearranging terms, this means we need 

> ^Eo[hiX)L-aLlogL]/a 



logEoie^^^'^/"' 



(28) 



(29) 



To prove (29), observe that, for any likelihood ratio L, 



[h(X)L/a-Llog L] 



by using the convexity of the function e' and Jensen's inequality over the expectation Eq[L ■] 
in the last inequality. Note that equality holds if and only if h{X)/a — logL is degenerate, i.e. 
h{X)/a — logL = constant, which reduces to L* . Hence uniqueness also holds. 

In conclusion, when 1/a G T)'^ := {6 G \ {0} : ip{9) < oo}, the optimal solution of (25) 
is given by (26), with the optimal value alog£'o[e''*-''^''^"]- This fact will be used in the proof of 
Theorem (T) 



Proof of Theorem^ We want to express the optimal value of (24) in terms of tj. Let a* be the 
optimal dual solution. Our proof scheme is divided into the following two steps: 

Relation between ij and a*. By complementary slackness, either a* = or, in the case a* > 
0, £'o[L*logL*] = T]. Let us assume, for the moment, that the latter case holds, i.e. a* > 0. 
Moreover, we assume that a* is sufficiently large, namely that 1/a* G D^. We will exclude the 
other possibilities on a* at the end of this proof. Now we have 



= E,IL- log L'l = _ log £„[e'"(-V)A.-] 



(30) 



where /3 = 1/a* , and tp{P) = logi?o[e'^''^"'*"''] is the logarithmic moment generating function of h{X). 
Relation between the optimal primal objective value and a*. The primal objective is 

So[/j(^)e^W/"*] Eo[h{X)e'^^^^^ 



Eo[hiX)L* 



Eo[e^^(^)] 



(31) 
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This gives the first part of the theorem. It is then a simple application of Taylor's expansion to 
arrive at the asymptotic relation in ([6]). First, note that the relation (30) can be written as 



oo ^ oo ^ 

n\ ^-^ n! 



n=0 



E 



1 



1 



(n — 1)! n! 



n=0 
1 



n=2 



n=l 



n{n-2)\ 



(32) 



where k„ = '(/'("^(O) is the n-th cumulant of h{X). By Assumptions [l] and [2| ^/;(-) is strictly convex, 
so /3tP'{(3) — has a unique zero at /3 = 0. Consequently, we can invert the relation (32) to get 



|3 




12, ^+25,^^151^.^ , 

3 K2 4 K2 



-1/2 



r? + 0(r?3/2) 



As a result, (31) can be expanded as 



Eo[hiX)L*] 



V;'(/3) = Ki + K2/3 + K3^ + 0(/33) 



K2 o ^2 

Ki + V2^r?V2+l% + 0(^3/2) 



3 K2 



3/2^ 



2 \K2 



+ 0{r] 



3/2N 



which gives ([6]). 

To conclude the theorem, we need to argue that a* is indeed positive and is sufficiently large, 
namely that 1/a* G D^. Since the objective of the dual problem, i.e. the objective in the outer 



minimization in (24), is convex, we need only show that our choice of a* that satisfies (30) is a 



local minimum. To this end, note that any a in a neighborhood of a* has dual objective function 



aV'(l/a) + ctrj (which can be seen by, for example, the calculation in (28)). The derivative of this 
objective function {d/da){atp{l/a) + ari) = tp{l/a) — (l/a)^' {l/a)+r] is equal to at a*. Moreover, 
it is negative before and positive after a* in its immediate neighborhood, by a simple analysis using 
the strict convexity of This concludes that a* chosen in (30) must be the unique optimal dual 
solution. 

□ 

The simple convexity argument to obtain L* in Proposition [T] no longer suffices for multiple 
time horizon. In that situation, the structure of the optimal change of measure will not be a simple 
exponential tilting, but instead will be characterized as the fixed point of a contraction map on the 
£i-space. The next section will focus on this development. 
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5 Main Methodology 

Our goal in this section is to lay out the methodology in proving the result for general time hori- 
zon T > 1, namely Theorem [2] The cost function h{-,...,-) now depends on an i.i.d. sequence 
{Xt}t=i,...,T, each with benchmark distribution Pq. Leveraging the idea in the previous section, we 
first write the maximization problem ([T]) in terms of likelihood ratio L: 

max £'o[/i(XT)I/'p] 

subject to Eo[L{X) log L{X)] <r] (33) 
L€C 

where for convenience we denote = Y\t=i L{Xt), and X as a generic variable that is independent 
of {Xt}t=i,...,T and having identical distribution as each Xt. The following is the recipe from the 
last section that we will follow: 



1. Consider the Lagrangian relaxation of (33), and characterize the solution for the inner maxi- 
mization. 



2. Argue that the Lagrangian relaxation gives the optimal value for the primal objective in (33). 

3. Expand the solution in the dual variable a* and hence r/. 

For T > 1, the main technical challenge is the product form that appears in the objective 



function in ( 33 ) . This hinders a direct convexity argument in finding an optimal change of measure 
L* . To get around, we shall introduce a fixed point equation to find a candidate solution, and 
use an induction argument to establish optimality. The following subsections will implement these 
step-by-step. 

5.1 Optimal Change of Measure via Fixed Point Equation 



Before we start our analysis, let us first discuss some features of the optimization problem (33). 
As a comparison, suppose that we treat as a single variable, and consider a deviation of KL 
divergence within Trj units. Then the optimization formulation following (23) in Section |4] will give 



max £'o[/i(XT')-L(Xr)] 

subject to £;o[L(Xr) log L(Xt)] < T77 (34) 



where £^ is now defined on the T-product measure of Pq. This primitive formulation (34) provides 



an upper bound on the optimal value of (33). It can be seen by writing the constraint in (33) 
as E[Lj,\ogLrp\ < Tr], since Xf's are i.i.d., and then relaxing the implicit constraint that Lj- = 

ULiHXt). 



This observation also implies that (33) has a bounded optimal value, i.e. it is bounded by the 
optimal value of (34). The latter is finite because Assumption [3] implies that 

E,[e<^H^T)^ < i?o[e^^"-i^*(^*)] < H ^o[e'^' < 00 (35) 

t=i 

when 9 is small enough. Hence /i(Xr) has exponential moment and Assumption [l] is satisfied. It is 
also trivial to check that Assumption [4] implies Assumption [2} Therefore, the result in Theorem [l] 



applies to (34). 
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Now, focusing back on the formulation (33), we shall proceed by first reducing the size of the 
space that L subsists in the formulation. We will soon see that this is useful for constructing an 
associated contraction map. Moreover, we shall define a metric for the contraction to facilitate 
our later development of monotonicity properties in reaching the involved objective values. Let 
^(^) = Yld=i^i{^) where A.t{x) is defined in Assumption [sj Define 

C{M) = {L€C: Eo[A{X)L{X)] < M} 

for M > 0, and the associated norm \\L\\\ := -Eo[(l + ^i-^))LiX)] and hence metric 

\\L - L'IIa = ii;o[(l + A{X))\L{X) - L'{X)\]. (36) 

It is routine to check that C{M) is complete. The following observation is useful for our development: 



Lemma 1. For any rj < N for some small N, the formulation (33) is equivalent to 

max EQ[h{Ji.T)L.T] 

subject to Eo[L{X)logL{X)] <r] (37) 
L G £(M) 

for some large enough M > 0, independent of rj. 

Proof. As in the rest of this paper, we shall suppress the dependence oi X m. L = L{X) whenever 
no confusion arises. We want to show that £'o[LlogL] < r/ and L £ C together implies L E C{M). 
Note that A{X) has exponential moment, since Holder's inequality implies 

£;o[e^A(^)] = i^o[e^S"-i^*(^)] < llme^''''^''V^^ < oo (38) 

t=i 

when 9 is small enough. Hence, for any L £ C that satisfies EQ[L\ogL] < r], we have 
ii;o[e^^(^)] = Eo[LL-'e'^^''^] = Eo[Le'''^''^-'°^'^] < oo 



for small enough 6, by (38). Jensen's inequality implies that 

Since EolLlogL] <r]<N,we have Eo[A{X)L] < M for some constant M. So L G C{M). This 
concludes the lemma. □ 



The Lagrangian relaxation of (37) is given by 



with inner maximization 



min max EolhCXT)^^] - a(Eo[LlogL] - r]) (39) 
a>0 LeC(M) 



max Eo[h('XT)LT]-aEo[LlogL]. (40) 

Le£{Af) 



Note that the first expectation Eo[-] in (40) is acted on the product space of {Xt} 



t=i, 



To get some intuition on how to solve (40), note that when h is separable and symmetric, i.e. 
/i(Xt) = Ylt=i "^{^t) for some w{-), the quantity -Eo[/i(Xr)L'p] is also separable as £'o[/i(Xr)L2-] = 



IlI=iEo[w{Xt)L{Xt)] = TEo[w{X)L{X)]. So the solution for ^ reduces to Proposition 111 but 
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with a replaced by a/T. The challenge is the general case when such separation for h does not hold. 
In this case, one useful way to view EQ[h(X.T)LLT]^ a mimic of the aforementioned separability, 
is through writing 

T 

TEo[h{XT)LT] = ^Eo[gl'{Xt)L{Xt)] = Eo[g''{X)L{X)] 
t=i 



where gt{x) = Eq 



and g^{x) = Ylt=i9ti^)- '^^^ functions gl' and 



h{^T)\{l<r<T L{Xr) Xt = X 
L r^t 

g^ will appear frequently in the sequel. Note that g^ now depends on L, which differs from the 
symmetric separable case. 

The analog of Proposition [l] for time horizon T > 1 is: 

Proposition 2. Suppose Assumption holds. When a > is large enough, the unique optimal 
solution of (40) satisfies 

Hx) = -^^T^j:7^. (41) 



where g^{x) = Ylt=i dti^) ^'^'^ dti^) ~ 



Eo[e9Hx)/a] 
h(X.T) ni<r-<r L{Xr 



Xt = x 



Note that the function g^{x) can be viewed as the expectation of the function Sh defined in 



(14), under the shifted measure Pf. In other words. 



= ^Ef[hiXT)\Xt = x] = Ef[ShCS.T)\Xi 



x\ 



t=i 



As in Section |4| the form in (41) can be guessed from a heuristic differentiation with respect to 
L. Consider the Lagrangian relaxation of the constraint i?o[-^^] = 1 in ( |40[ ): 

Eo[h(XT)LiXi)LiX2) ■ • •L(Xt)] - aEo[LlogL] + XEo[L] - A (42) 



There are T factors of L in the first term. A heuristic "product rule" is to differentiate each L 
factor, keeping all other L's unchanged. To do so, we condition on Xt to write 



Eo[h{XT)L{Xi)LiX2) ■ ■ ■ HXt)] = Eo 



Eo 



l<r<T 



Xt 



HXt 



and it turns out that the "right" differentiation of L on this quantity is represented as a sum of 
conditional expectations 



dL{x) 



En 



h{XT)llL{Xt) 



t=i 



t=i 



l<r<T 



Xt = x 



(43) 
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So the "product rule" gives an Euler-Lagrange equation as 



which gives 



t=i 



i<r<r 



a log L{x) — a + A = 



(44) 



L(x) oc exp < 



t=i 



l<r<T 



a 



The constraint -Eo[-L] = 1 then enforces the expression (41). To gain more understanding of the 



"product rule" (43), let us illustrate with a finitely supported X: 



Examples. Consider two i.i.d. random variables Xi and X2, and a cost function h{Xi,X 2). The 
variables Xi and X2 have finite support on 1,2, ... ,n under Pq. Denote p{x) = Po{Xi = x) for 
X = 1,2, ... ,n. The objective value in (33) in this case is 



Eo[h{Xi,X2)L{Xi)L{X2)]= E ^(^i'^2MxiMx2)L(xi)L(x2) 

2:1=1 2:2=1 

Now differentiate with respect to each L{\), L{2), . . . , L{n) respectively. For i = 1, . . . ,n, we have 

^ Eo[h{Xi,X2)] = h{xi,i)p{xi)p{i)L{xi) + h{i,X2)p{i)p{x2)L{x2) + 2h{i , i)p{if L{i) 
dL{i) 

= EQ[h{Xi,i)L{Xi)\ + Ef,[h{i,X2)L{X2)]. 



This coincides with the product rule (43) discussed above 



We shall justify the above heuristic rigorously. The case for T = 2 will be presented in the next 
subsection. We will then outline the general case, whose idea is similar but involves a few extra 
technical constructions. The full proof of the general case will be deferred to the appendix. 

5.2 Proof of Proposition [2] for T = 2 

In the case T = 2, g^{x) = So[/i(^i, ^2)^(^2)1^1 = x] + Eo[h{Xi,X2)L{Xi)\X2 = x] = 
EQ[Sh{Xi, X2)L{X2)\Xi = x] where Sh is defined in (14). The proof of Proposition [2] centers 
around the operator K. := K,{h,a){-) : C{M) — )• C{M), defined as 



/C(L)(x) 



,£;o[Sh(Xi,X2)L(X2)|Xi=2:]/a 



(45) 



Our argument is mainly divided into three steps. First, we prove that /C is a well-defined 
contraction mapping on C{M). This implies the existence and uniqueness of a fixed point in C{M). 



In the second step, we show that the objective value of the inner maximization (40) is, in certain 



sense, monotone in the iterations of /C. The third step comprises proving that this objective value 
converges to the value evaluated at the fixed point, which will then conclude Proposition [2] 
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For convenience, in the subsequent calculation we let C be a positive constant, not necessarily 
the same every time it appears. 

Step 1: Contraction Mapping. We prove that /C is a well-defined contraction map on C{M), 



equipped with the norm defined in (36). The existence and uniqueness of a fixed point for /C is 



established by Banach fixed point theorem (§1 in [T]). 
Lemma 2. With Assumption^ for sufficiently large a, the operator K, : C{M) — )• C{M) is well- 



defined, closed, and a strict contraction in C{M) equipped with the norm \\ • ||a introduced in (36). 
Hence there exists a unique fixed point L* G C{M) that satisfies /C(L*) = L* . 

Proof. We show each element in the lemma one by one: 

Well-definedness : Let j3 = l/a for convenience. To prove that /C is well-defined on C{M), we shall 
show that, for any L £ C{M), < £:o[e^-E^o[5h(Xi,X2)L(X2)|Xi]] ^ ^ g^^^^^j enough /3. Recah 

Assumption [3] and the notation A(x) = Ai(x) + A.2{x) in (38). We have 



^^jg/3i?o[5h(Xi,X2)L(X2)|Xi]j < _g^jg/3i=;o[(A(Xi)+A(X2))L(X2)|Xi]j 

= £;Q[e/3('^(^i)+^o[A(^2)i(X2)])] 

= ^Q[e/3A(-^)]e/3^o[A(X)L(X)] 

< 00 

for L G C{M) and small enough /3, where we use the independence of Xi and X2 in the first equality. 



the definition of C{M) in the second inequality and (38) in the last inequality. Similarly, we also 
have 

_g^jg/3£o[5h(Xi,X2)L(X2)|Xi]j > _g^jg-/3£;o[(A(Xi)+A(X2))L(A-2)|Xi]j > [e-/^^(^)]e-'^^'^ > (46) 

for small enough /?. Hence the denominator in the definition of /C is finite and non-zero. As a con- 
sequence, we have also shown that the numerator of /C given by gl^^olSh{Xi,X2)L{X2)\Xi=x] -g f[j^[iQ 
a.s.. Therefore /C is well-defined. 

Closedness : Next we show that fC is closed in C{M) for large enough M, or more precisely, when 
M > Eq[A(X)]. For any L G C{M), we have, by similar argument as above, that 

^ rAr^^<r.^^.^M _ Eo[A{X,)e^^o[SniX,,X,)LiX,)\X,]^ ^^[^(^)g/3A(X)]g2/3Eo[A(X)L(X)] 
ho[A{X}K.{L){X)\ - ^^^^^Eols,iXuX,)Lix,)\x,]^ ^ Eo[e-P^(^)] 
Eo[A{X)e^MX)^e'PM 

As P 0, we have Eo[A{X)e'^^^^^] ^o[A(^)] and EqIc'^^^^'i] 1. Hence, by choosing 



M > E'olAl-^)]; (47) is bounded by M when /3 is small enough. The result then follows. 
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Contraction : To check that /C is a contraction, consider, for any L, L' G C(M), 



||/C(L) -/C(L') 



Eo[il + AiX))\IC{L){X) - ICiL')iX)\] 



(l + A(Xi)) 
(1 + A(^i))' 



,l3Eo[Sh{X:,,X2)L{X2)\Xi] 



,(3Eo[ShiXi,X2)L'iX2)\Xr] 



[g/3Eo [Sh (Xi ,X2 )L' (X2 ) I Xi ] ] [e/3Eo [Sh (Xi ,^2 )L' (X2 ) | Xi ] ] 



-(e 



/3£;o[Sh(Xi,X2)L{X2)|Xi] _ ^pEolSh{Xi,X2)L'{X2)\Xi]^ 



-(£;o[e^^°['^'^(^i'^'^^(^')l'^i]] - ^Q[e/^^o[5ft(Xi,X2)L'{X2)|Xi]j^ 



(48) 



by using mean value theorem, where (^1, ^2) hes in the Hne segment between (^e^^olSh(Xi,X2)L{X2)\Xi] ^ 

^^[g/3£;o[5;,(Xi,X2)L(X2)|Xi]]) (g/3i?o[Sh(Xi,X2)L'(X2)|Xi]^_£;^[g/3£;o[5h(Xi,X2)L'(X2)|Xi]]^_ gy gg^^ 

have ^2 > 1— e for some small e > 0, when /3 is small enough. Moreover, ^1 < g^(A(''<^i)+-E^o[A(-''^)-'^(,-^)]) < 
g/3A(Xi)+/3M_ -yy^g ^lave that (lisl) is less than or equal to 



< 



^0 

+ sup 

1 , 



JEo[Sh{Xi,X2)L{X2)\Xi] _ BEo[Sh{XuX2)L'{X2)\Xi] 



1 



+ 



l + A(Xi))(^sup 

_g^jg/3£;o[5h{Xi,X2)L(X2)|Xi]^ _ ^^^g/3i?o[Sft(Xi,X2)L'{X2)|Xi] 
l + A(Xi)) e/^'^o['^''(^i'^2)L(X2)|Xi] _ g/3£;o[Sh(Xi,X2)L'(X2)|Xi] 



(1^ 



(l + A(Xi))e 



mxi) 



^^[g/3£;o[5h{Xi,X2)L(X2)|Xi]j _ ^^jg/3£;o[5ft(Xi,X2)L'{X2)|Xi]] 



< CEo 



[l + A{Xi))e 



mxi) 



JEo[Sh{Xi,X2)L{X2)\Xi 



JEo[Sh(Xi,X2)L'iX2)\Xi] 



< C/3So[(l + A(Xi))e2/^^(^i)|Eo[5/.(Xi,X2)L(X2)|Xi]-^o[5h(^i,X2)L'(X2)|Xi]| 

by mean value theorem again 

< Cf3Eo[{l + A{X,))e^^^(''^^\Sh{XuX2)\\LiX2) - L'{X2)\] 

< C/3i?o[(l + A(Xi))e2/5^(^i)(A(Xi) + A(X2))|L(X2) - L'{X2)\] 

= Cf3{Eo[{l + A(X))2e2/3^(^)]i?o|^(^) - L'{X)\ + Eo[{l + A{X))e'^''(''^]Eo[A{X)\L{X) - L'{X)\]) 

< CP\\L-L'U 

Hence when /3 is small enough, i.e. a is large enough, we have C/3 < 1, and /C is a strict contraction. 
By Banach fixed point theorem, there exists a unique fixed point L* that satisfies /C(L*) = L* . □ 

Step 2: Monotonicity of the Objective Value. We prove that the iteration associated with /C 
is monotonic in the objective value in (40) in the following sense: 

Lemma 3. For sufficiently large a and given any L(i),L(2) g C{M), construct the sequence 
= IC{L^^^) for k > 2 by iteratively applying K,. Then the value of 

Eo[Sh{X^,X2)L^''\x^)L^''+^\X2)] - aEolL^"') logL^] - ai?o[i^'+'^ logL(^-+i)] 
is nan- decreasing in k, for k > 1. 
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Proof. By Proposition 1 for any A: > 2, L^^+i) is a maximizer of ( |25[ ) in Section [4| with cost 
function So['5fe(^,^2)I^'n^2)|^i = x]. Hence, for any A: > 1, 

Eo[Sh{XuX2)L(''\Xi)L^'^+'\X2)] - aEolL^''^ logL^'^] - aE^[L^^+^) \ogL^^+^^] 
= Eo[Eo[Sh{Xi,X2)L^^+^\X2)\X,]L^^\X{)] - aE^[L^^'^ logL^'^)] - aEo[L^^+^^ fogL^'^+i)] 
< Eo[Eo[Sh{Xi,X2)L^''+^HX2)\X,]L^''+^\X^)]-aE^[L^^+^hogL^^^^^^ 
= i?o[5/.(Xi,X2)L('=+i)(X2)L(^+2)(Xi)] - ai?o[L('=+2) ^^^L(k+2)^ _ o,e,[L^^+^'> \ogL^^+^'^] 
= i?o[5/.(Xi,X2)L('=+i)(Xi)L('=+2)(X2)] - ai?o[L('^+^) logL(^+i)] - ai?o[i('+') fogL(^+2)] 

by using Proposition [T] in the inequahty and and the symmetry of Sh in the last equality. This 
concludes the lemma. □ 

Step 3: Convergence of the Objective Value to the Optimum. We argue that the following 
convergence holds: 

Lemma 4. When a is sufficiently large, for any L(^),L(2) g C{M), we have 

i?o[5h(Xi, X2)lW(Xi)l('=+i)(X2)] - aE^[L^^^ logL^] - aE^lL^^^^^ logL^'^+i)] 
^ EQ[Sh{Xi,X2)L*{Xi)L*{X2)\-2aEQ[L*\ogL*] (49) 

where L^^^ is defined as in Lemma\^ and L* is the fixed point of K,. 

Proof. Let /3 = 1/a. First, recognize that the fixed point theorem implies convergence of L'^'^-* to 
L*, i.e. ||L('=) - L*||a ^ as A; ^ oo. 

We consider the convergence of different parts of (49) separately. For the first term, consider 

|£;o[5„(Xi,X2)L«(Xi)l('^+i)(X2)] - Eo[Sh{Xi,X2)L*{Xi)L*{X2)]\ 

< Eo[\Sh{Xi,X2)\\L^''\Xi)L^'+^\X2) - L*iXi)L*{X2)\] 

< Eo[{A{Xi) + A(X2))(L('=+i)(X2)|lW(Xi) - L*{X,)\ + L*(Xi)|L(^+i)(X2) - L*(X2)|)] 
= Eo[A{Xi)\L('\Xi) - L*(Xi)|] + Eo[A{X2)L^'+'\X2)]Eo\L^'HXi) - L*{Xi)\ 

+ Eo[A{Xi)L*{Xi)]Eo\L^'+'\X2) - L*{X2)\ + Eo[A{X2)\L^''+'HX2) - L*{X2)\] 
by the independence of Xi and X2 and the fact that ^of^^''^] = Eo[l'-''+'^^] = Eo[L*] = 1 

— L* ||a 

^ (50) 
as A; —7- 00. 

We now consider the remaining terms in (49). For A; > 3, by noting that L^''^ = 1C{L^^^^'>), we 
have 

\EQ[L^^hogL'^^\-E^[L*\ogL*]\ 
= |(/3Eo[5,(Xi,X2)L(^')(Xi)L('=-i)(X2)] - log£:o[e^^"[^''(^^'^^)^''"''(^^)l''^l]) 

- (/?So[5;.(Xi, X2)L*(Xi)L*(X2)] - logSo[e''''°t''''^''^'''^)''*^''^^™])| 
< P\Eo[Sh{X^,X2)L^''\X^)L^^-^\X2)]-Eo[SH{XuX2)L*{X,)L*{X2)]\ 

+ I log^oie'^^"^^''^^''^'^-^"""^^'^'^''] - log-Eo[e''-^"['^'^(^i'^2)L*(^2)|Xi]j| ^5^^ 
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The first term in (51) converges to by the same argument as (50). For the second term, we can 
write, by mean value theorem, that 

logE^oie'^^"^'^''^'^'"^'^^""''^'^'^'^''] -log£;o[e''^°'^"^'^''^'^^''^'^'^'^''] 

= l(£;p[g/3£;o[Sft(Xi,X2)L('=-i)(X2)|Xi]j _ ^^^^l3EolSUXi,X2)L* {X2)\X^]l^^ (^^2) 



where lies between Eo[e'^^o^^h{Xi,X2)L(>' i)(X2)|Xi]j ^^^^ ^^^^i3Eo[Sh{Xi,X2)L'-{X2)\Xi]i^^ ^^^^^^ 
^ EqIc^ ^^^^')]e^^°^^^^^^^^^^ > 1 — e for some small e, when /3 is small enough. Moreover, 

\Eo[e^^olS,(X,,X2)Li'^-^HX2)\X,]y-^^^^pEolS,i^^^^^ 

for some 6 lying between Eo[Sh{Xi, X2)L(''-^\X2)\Xi] and Eo[Sh{Xi, X2)L*{X2)\Xi], and hence 
6 < MXi) + Eo[A{X)L{X)] < A(Xi) + M. So, much like the argument in proving Lemma[2} ([52 ) 
is less than or equal to 

^^(i?o[A(X)e^^W]i?o|L('-'^(^) - L*{X)\+Eo[e^^^''^]Eo[A{X)\L^''-^\x) - L*{X)\]) 

< CIIL^'^-^) -L*||a 
^ 

as A; — )• oo. We conclude the lemma. □ 
Final Step. Proposition [2] is now a simple consequence of the above steps: 

Proof of Proposition^ Given any L G C{M), fix L^^^ = L^^^ = L and construct the recursion 
{L^'^^}k>i as in Lemmas [s] and |4j Combining these two lemmas will conclude that 

Eo[Sh{Xi,X2)L{Xi)L{X2)] - 2aEo[LlogL] < Eo[Sh{Xi, X2)L* {X^)L* {X2)] - 2aEo[L* log L*] 

and hence Proposition [2} □ 

5.3 Outline of Argument of Proposition [2] for General Time Horizon T. 

We discuss briefly how to generalize our argument in the previous subsection to general time horizon 
T. There are two natural generalizations of the operator /C defined in the T = 2 case. One is a 
straightforward definition in view of (41), defined on C{M) — )• C{M): 

^g^{x)/a ^Eo[ShO^T)L2:T\Xi=xya 
= ^p[ggi{X)/a] = ^g[e£;o[S;>(XT)L2:Tl^l]/"] ^^^^ 



where Sh is defined in (14), and L2:T •= Ylt=2 -^i-^t)- 

Another generalization, which is more useful in the proof, is constructed as follows. Rather than 
using Sh, we define a uniformly symmetric function derived from h as 

S,(x) = /i(x) (54) 
xg5t 



where St is the symmetric group of all permutations of x = (xi, . . . ,xt)- The summation in (54) 
has T! number of terms. By construction, the value of Sh is invariant to any permutations of its 



25 



arguments. In the case of T = 2, Sh is the same as Sh defined in (14), but the difference is apparent 
when T > 2. 

The function Sh is only an artifact in the proof and does not appear in either the solution or 
practical evaluation. The main use of Sh is to construct an operator jC : C{M)'^~^ — )■ C{M)'^~^ 
as follows. Denote L = (Li, . . . ,Lt-i) G C{M)^^^. We first define the component-wise mapping 
K : C{M)^-^ C{M) given by 

Eo[Sh{X,X^,...,XT-i)l\TZi Lt{Xt)\X=x\/{a{T-iy) 

K(Li,...,Lt-i)(x) := = (55) 

^ ' ^ y ^^^^Eo[SUX,X,,...,XT-i)UlJrLt{Xt)\X]/(a(T-iy.)i^ ^ ^ 

Then for a given L, define 

Li =K{Li,...,Lt-i) 

L2 = K{Li,L2, ■ ■ . ,Lt-i) 

L3 = K{Li,L2,L3,...,Lt^i) (56) 

Lt-i = K{Li, . . . , Lt-2, Lt-i) 
The operator IC on C{M)'^~^ is defined as 

£(L) = (Li,...,Lt-i) (57) 

It is the second operator IC that turns out to satisfy a monotone property in the objective 
function, which is the key in proving the general case T > 2. It also holds that each component 
of the fixed point in IC is identical and they all coincide with the fixed point in IC presented in 



(53), the latter defined on a much smaller space. The smaller space of IC then becomes useful for a 
truncation argument in random time horizon problems. 

The full argument for the general time horizon T > 2 will be deferred to the appendix. Following 



Section 5.2 below is a summary of the main steps leading to the proof. First, we need to show the 
contraction property and characterize the fixed point of IC with an appropriate metric: 

Lemma 5. With Assumption^ when a is sufficiently large, the operator IC : C{M)'^^^ — t- £(M)^^^ 



defined in (57) is well-defined, closed and a contraction, using the metric d{-,-) : C{M)'^ ^ x 
C{M)'^-^ M+ defined as 

d(L,L') = ^ max^i?o[(l + ^{X))\Lt{X) - L[[X)\] 

where L = (Li, . . . ,Lt-i) G £(M)^~^ and L' = (L'^, . . . , L'j,_^) € £(M)-^~^. Hence there exists a 
unique fixed point L* that satisfies /C(L*) = L*. Moreover, all components o/L* are identical. 

The property that the fixed point L* has identical components will conclude a convergence 



result for the component- wise iteration defined in ( 55 ) : 

Corollary 1. With Assumption^ and sufficiently large a, starting from any L^^\ . . . , L^'^~^^ G 
C{M), the iteration L(^) = K{L^''-'^+^\ . . .,L^^-^^) where K : C{M)'^~^ ^ Cis as defined in (pj), 



for k > T, converges to L* in \\ ■ \\\-norm, where L* is the identical component of the fixed point 
L* oflC. 



We shall consider the objective in (40) multiplied by T!, i.e. 



T\{Eo[h(XT)LT] - aEo[LlogL]) = Eo[Sh(XT)LT] - aT\Eo[LlogL]. (58) 



The component-wise mapping K possesses a monotonicity property on this scaled objective: 
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Lemma 6. For sufficiently large a, starting from any L(i), . . . , L(^) G C{M), construct the sequence 
]^{k+i) _ x{L^^~'^^'^\ . . . ^L^^^) via the component- wise iteration K, for k >T. Then 



En 



t=i 



a 



(T-l)! J^Eo[^^''+*"'^logL('=+*~^)] 



(59) 



t=i 



is non- decreasing in k, for k > 1. 



Finally, we show the convergence of (59) to the scaled objective in (58) evaluated at any com- 
ponent of the fixed point of IC: 



Lemma 7. For sufficiently large a, starting from any L^^\ . . . , L^'^^ G C{M), we h 



ave 



En 



Eq 



t=i 

T 



a 



(T - 1)! ^ Eo[L^''+'-^^ log L^k+t-^) 



t=i 



Sh(;s.T)llL*{Xt 



t=i 



aT\Eo[L* logL* 



where L^'^^ is defined by the same recursion in Lemma^ and L* is any (identical) component of 
L* G C{M)'^~^ , the fixed point of K, defined in ([57]). 



As in Section 5.2, given any L G C{M), Lemmas [6] and [7] will conclude that 



Eq 



Sh{'^T)J[L{Xt 



t=i 



a{T -iy.^Eo[LlogL] < Eq 



t=i 



Sh{yiT)llL*{Xt 



t=i 



aT\Eo[L* log L* 



by defining L^^^ = ■ ■ ■ = L^^^ = L and using the recursion defined in the lemmas. Here L* is the iden- 
tical component of the fixed point of IC. Moreover, ^(L*) = L* entails setting L* = K{L*, . . . ,L*) 



for K defined in (55), which concludes that L* satisfies (41). This concludes Proposition 2 



5.4 Duality Theorem and Asymptotic Expansion 

Since the primal objective for the case T > 1 is not necessarily concave, it is not immediate that 



the dual problem (39) provides the same optimal value as the primal problem (37). The following 



theorem, from Theorem 1 in Chapter 8, p. 220 in p9], gives a sufficient characterization over general 
vector space that avoids convexity. We state in the notation of this paper: 

Proposition 3 (a.k.a. Chapter 8, Theorem 1 in |39j). With Assumptions^ and^in hold, suppose 
there is an a* , with a* > 0, and an L* G C{M) such that 



EQ[h{y.T)LT] - a*EQ[L*\ogL*] > Eo[h{XT)LT] - a*Eo[LlogL] 
for all L G £(M). Then L* solves 

max Eo[h{'X.T)LT] 
subject to Eq [L log L] < Eq [L* log L*] 
L G £(M). 



(60) 
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For any sufficiently large a* > 0, we have already obtained an L* (parametrized by a*) that 
satisfies (60), by Proposition [2| If we can find, for each in a neighborhood of 0, a large enough a* 
such that £'[L*log-L*] = rj, then this L* will be an optimal solution to the optimization problem 
(37) in view of Proposition [sj The condition rj = EQ[L*logL*] is the complementary slackness 
condition used in the proof of Theorem [l} In fact, similar to that proof, to find an a* that satisfies 
E[L* log L*] = rj, we carry out an asymptotic expansion of E[L* log L*] in terms of a* , and invert the 
expansion to express a* in terms of rj. This will verify the optimality of L* for (37) by Proposition 
[3j After that, another expansion of the objective function EQ[h(X.T)L.T] terms of a* and hence 
r] will then give rise to the nonparametric derivatives. 

The details of the aforementioned asymptotic expansions that arrive at Theorem [2] will be 
left to the appendix. Below we will sketch the essential steps. The starting point is a quadratic 
approximation of L*: 

L*{x) = 1 + \igix) - Eo[g{X)]) + -^V{x) + ■■■ 
a* a* 

where we define V{x) := W{x) - Eo[W{X)] + ^{{g{x) - Eo[g{X)])^ - Eo[{g{X) - Eo[g{X)])^]) and 
W{x) := Ylt=i'l2i<r<T EQ[h(X.T){g{Xr) — EQ[g{X)])\Xt = x]. With this approximation, we will 

arrive at the following expansions: 

Relation between a* and r/: We can expand 



r, = Eo[L* logL*] 
1 



2a* 



Varo{g{X)) + 



a 



*3 



Eo[g{X)V{X)] - -K3{g{X)) 



+ 



a" 



Under Assumption [4| we can invert (61) to get 

' 2r]iEo[g{X)ViX)] - (1/6)a^3(5(^))) 



1 

a* 



' 27] 

Varo{g{X)) 



{Varo{g{X))y 



+ 0{rj 



(61) 



(62) 



This verifies the condition in Proposition 3j and concludes that L* , for some large enough a* > 0, 
is an optimal solution to the program (l37) when r] is small enough. 



Relation between the objective value and a*: The primal optimal objective value in (37) can 
be expanded as 



Eo[h{XT)L^] = Eo[h{-KT)] + ^VaroigiX)) + ^ [| + Eo[giX)V{X)] 



+ 



a 



which, combining with (62), will become 



-K3{g{X)) + u 



+ 0(7?3/2 



This will conclude Theorem [2l 
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5.5 Extension to Random Time Horizon Problems 



In this subsection we shall discuss the extension to random time horizon problems, under Assump- 
tion [5] This will be indicative of the argument under the alternate Assumption [6j which requires 
more technicality and will be left to the appendix. First, the formulation (17) can be written in 
terms of likelihood ratio: 

max £'o[/i(Xt-)Lt-] 

subject to Eo[L{X)logL{X)] <r] (63) 
LeC 

where = Ylt=i L{Xt) comes from the martingale property of sequential likelihood ratios. We 
shall explain how to leverage our result in the last subsection to obtain the expressions in Theorem 
[sj Under Assumption[5| t < T a.s., for some T > 0. Hence the objective in (63) can also be written 
as EQ[h{X.T-)L^] = Eo[h{X.T-)L'j'], again by the martingale property of L^. 

This immediately falls back into the framework of Theorem [2j with the cost function now being 
/i(Xt-). For this particular cost function, we analyze g{x) and G{x, y) in Theorem [2] and argue that 
they are indeed in the form stated in Theorem [3j We can write 

T T T 

g{x) = Y,Eo[h{-Kr)\Xt = x\ = Y,Eo[h(Kry,T> t\Xt = x] + Y,Eo[h(Xry,T < t\Xt = x] (64) 



t=l 



t=l 



t=l 



Consider the second summation in (64). Since /i(Xt-)/(t < t) is J-f_i-measurable, it is independent 
of Xf. As a result, the second summation in (64) is constant. Similarly, we can write 

r 



G{x,y) = Y, Yl Eo[hi:S.^)\Xt = x,Xs = y] 



t=l 3=1,. ..,T 

T 



Y Ea[h{^r)■,T>t^s\Xt = x,Xs = y\ + Y Y Eo[h{^r);T <t^s\Xt = x,Xs 

(65) 



t=l 



=1,...,T 



t=l 



= 1,...,T 



and the second summation in (65) is again a constant. From discussion point [s] in Section 3.2 



since 



the first and second order nonparametric derivatives are translation invariant to g{x) and G{x,y), 
Theorem [3] follows immediately. 

Along the same line of observations, we can also characterize the optimal change of measure 



to the Lagrangian relaxation of (63) analogous to Proposition [2| For convenience, denote L* = 
Y\i<r<s L{Xr), i.e. the product of likelihood ratios, but with the t-th step identical to 1. With a 

cost function /i(Xt-), the function g^{x) in Proposition ^ is now calculated to be 

T T 

g\x) = Y9ti^) = YEo[H^r)Lir\Xt = x] 
t=i t=i 

T T 

= J2Eo[h{'Xr)]Jr;r> t\Xt = x] + J2Eo[H'^t)]Jp;t< t\Xt = x]. (66) 
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For the first summation in (66), by conditioning on J> (or by breaking down /(r > t) into a 



summation in time steps r > t followed by conditioning each on we have 



T 



^Eo[h{Xr)L^;T> t\Xt] = ^So[/i(X,)Lt;T > t\Xt]. 



t=i t=i 



The main observation is that the second summation in (66) is independent of Xt- This is because 



Eo[h{Xr)L'j.;T< t\Xt] = Eo[h{Xr)Eo[L'T\Tt];T< t\Xt] = Eo[h{y.^)L^_^-T < t\Xt] 

and /i(XT-)L^_^/(r < t) is J-(_i-measurable, hence independent of Xt. The second summation of 
(66) is therefore a constant equal to ^^-^ £'o[/i(Xt-)Lj_]^; r < t\. The fixed point equation that 



guides the optimal change of measure is 
L{x) oc e^?^(^)/" = exp 



oc exp 



\ £^o[/i(X.)Lt; r > t\Xt = ^] + ^o[/i(X,)L,_i; r < t]^ | 
1^2^ ^o[/^(X.)Lt; r > t\Xt = x] I . 



Comparing to Proposition [ij then, g^{x) is replaced by Yld=i-^^V''0^'r)lJT'^'T > t\Xt = x] in the 
optimal change of measure in the context of bounded stopping time. 

The case when Assumption [6] holds instead of Assumption [5] is more involved. The main idea 
is to consider a sequence of truncated random time t AT, T = l,2,.... We then set up a sequence 
of contraction maps Kt{-) that is associated with the cost function h(X.rAT), and show that this 
sequence of maps converges to a limiting map K{-), yielding the stated result. Details will be 
provided in the appendix. 

6 Bounds on Parametric Derivatives 

Let us discuss how the minimization problem ^ is addressed by an easy adaptation and how the 
two problems ([T]) and ^ together give the range for all eligible parametric derivatives. 

6.1 Minimization Formulation 

All the results in Sections [3j |4] and [5] can be easily extended to the minimization formulation in 
([2]). One merely has to replace /i by — /i in all the theorems. This results in a change of sign while 
keeping the same magnitude for the first order derivative, and keeping both the same sign and 
magnitude for the second order derivative, for all Theorems [T| [2] and [3} For example, the expansion 
in Theorem [2] becomes 

Ef, [hiXr)] = EoIHJLt)] - V'^Varo{g{X)W/^ + y^^J^^^^-^^ (^'^3(ff(X)) + 1^^ V + 0{rf'^) 

where g{X) and v are defined as before. In Theorem [T| moreover, the /3* in ([s]) is chosen as the 
unique negative instead of positive solution in the minimization formulation. 

In terms of methodology, supposing one insists on using h instead oi —h and carries out the 
minimization program directly, the Lagrangian relaxations of all the formulations in Sections [4] and 
[5] will become max min problems instead of min max problems. This will then lead to optimal 
changes of measure that involve a < 0, where a is the negative dual variable. Such analysis will 
yield the same conclusions as above. 
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6.2 Parametric Dominance 



Because of the robust formulations ([T]) and ([2]), the proposed nonparametric derivatives dominate 
any parametric derivatives in the following sense: 

Proposition 4. Suppose Pq lies in a parametric family P^ with G C M, say Pq = where 
9q ^ @° . Denote E^[-] as the expectation under P^ . Assume that 

1. P^ is absolutely continuous with respect to for 9 in a neighborhood of 9q. 

2. D{9,9o) := D(P^||P''o) ^0 as9^9o. 

3. For any rj in a neighborhood of 0, D(9, 9q) = rj has two solutions 9'^(r}) > 9q and 9^{i]) < 9q; 
moreover, 



and 

Then 



d_ 
d9 

d_ 
d9 



> 



< 0. 



0=00 



exists. 



^i.»[MX)l/|i,(Mo) 



< ^/2Varo{C{X)) 



where C is the function h, g or g in Theorems^ and[^ respectively, depending on the structure 
of X that is stated in each theorem under the corresponding assumptions. 

This proposition is an intuitive testament that our nonparametric derivative, as a result of the 
optimization formulation, dominates any parametric derivative that moves along the admissible 
directions allowed in the model space in consideration. The proof is merely a simple application of 
the first principle of differentiation. 

Proof. Consider the general formulations ([l]) and ([2]), with Pq = P^°- Denote £^y+(^)[^(X)] as the 
optimal value of ([l]) and £'j-(^)[/i(X)] as the optimal value of Q, when 77 is in a neighborhood of 
0. Under our assumptions, P^ with D{9^{r]),9o) = 7] are feasible solutions to both programs 
and hence the quantity S^*('')[/i(X)] satisfies E^^^'^'^[h{X)] < E^+(^)[/i(X)] and [/i(X)] > 

[/i(X)]. This implies 

Sy-(,)[/i(X)] - Eo[h(M ^ E<^^M[hiX)]-Eo[h{-X)] ^ Ej+^[h{X)] - Eo[hiX)] 



Taking the limit as — t- 0, the upper and lower bounds converge to our nonparametric derivatives 
under the corresponding assumptions in each of Theorems [T| [2] and [3| The quantities 



S^*('')[/i(X)] 



v^=0 



lim 



E'^(^)[h{X)]-Eo[h{X)] 



become ^£;^[/i(X)]/^/D( 



and ^S^[/i(X)]/i^/D(M^ 



chain rule and implicit function theorem. This concludes the proposition. 



respectively, by using 

□ 
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7 Numerical Implementation 

In this section we demonstrate how our results, in particular Theorem [2| can be applied in practice. 
Let us use an example of many-server queues. Consider an FIFO queue with s number of servers, 
and customers arrive according to a renewal process and enact i.i.d. service times. Whenever the 
service capacity is full, newly arriving customers have to wait (for example, in an infinite-capacity 
waiting room). This generic system (and its variations) can model the dynamics of call centers, 
communication networks and other operation systems. In practice, the number of servers s can be 
big, and exact analysis is not available in general. 

One common approach is to impose Markovian assumption, yet such an assumption, especially 
for the service time distribution, is often questionable (§111. Ig in p]). In those situations, it is useful 
to assess the effect on important performance measures, such as workloads and customer waiting 
times, when the exponential assumption is violated. For example, let us consider the mean waiting 
time for, say, the 100-th customer in an M/M/s queue with arrival rate 1 and service rate 1/s 
(i.e. the system is critically loaded), starting from an empty system. The solid line in Figure [l] 
plots the sample average of this waiting time using 10,000 samples, for different server capacities 
s = 20, 30, 40, 50, 60. Also, the second and the third columns in Table [l] show these sample averages 
and the corresponding 95% confidence intervals. 

To quantify the sensitivity of the exponential assumption for the service times, we use Theorem 
[2] to compute the best and the worst-case first order approximations under different levels of input 
model deviation r/ in terms of KL divergence. Namely, we compute the quantity ^^2VarQ{g{X)) 
where Pq is Exp{l/s) and g{-) is the symmetrization function that is computed by sequentially 
conditioning on the service time of customers 1 through 100, as defined in ([s]). Then we depict 
the approximation E()[h(X.T)] i y^2VarQ{g{X))r] for different levels of rj, shown by the pairs of 
dotted lines in Figure [ij To get a sense, rj = 0.005 is equivalent to around 10% discrepancy in 
service rate if the model is known to lie in the family of exponential distribution; this can be seen 
by expressing the KL divergence in terms of service rate to see that roughly KL divergence w 
(% discrepancy in service rate)^/2. 

We also tabulate our point estimates of the nonparametric derivatives ^j2Varo{g{X)) and their 
confidence intervals in the fourth and fifth columns in Table [l} Moreover, for each s, we calculate 
the ratio between the nonparametric derivative estimate and the benchmark performance measure 
as an indicator of the relative impact of model misspecification: 

Magnitude of nonparametric derivative 

Relative model misspecification impact := — — 

Performance measure 



Number of servers 


Benchmark performance measure 
Mean C.I. 


Nonparametric derivative 
Mean C.I. 


Relative impact 


20 


5.023 


(4.905,5.153) 


47.840 


(43.106,52.574) 


9.513 


30 


3.039 


(2.944,3.134) 


35.437 


(32.256,38.618) 


11.661 


40 


1.510 


(1.445,1.575) 


21.504 


(19.663,23.345) 


14.239 


50 


0.498 


(0.463,0.533) 


8.255 


(7.338,9.171) 


16.575 


60 


0.079 


(0.067,0.091) 


1.802 


(1.490,2.114) 


22.702 



Table 1: Simulation results for performance measures and first order nonparametric derivatives for 
the mean waiting time of the 100-th customer in M/M/s systems with different server capacities 



As seen from Figure [T] and Table [T| the mean waiting time for the 100-th customer decreases 
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Figure 1: First order approximation of the deviation from benchmark performance measure under 
different input model discrepancies in terms of KL divergence 



from 5.02 to 0.08, as the number of servers increases from 20 to 60. This is because the effect of 
critical loading reduces as more servers are available, given that the system starts from emptiness. 
The magnitude of the nonparametric derivative also decreases correspondingly, from 47.8 when 
s = 20 to 1.8 when s = 60. However, the relative effect of model misspecification actually increases, 
from 9.51 to 22.70. This means that the larger the system capacity, the more influential it is to the 
mean waiting time when the exponential service time assumption is violated. 

The same method can be easily adapted to test other types of performance measures and models. 
For example, Figure [2] and Table [2] carry out the same assessment scheme for the service time of a 
non-Markovian G/G/s queue with gamma arrivals and uniform service times. Here we consider a 
deviation from the uniform distribution for the service time. In this scenario, we see from Table [2] 
that both the performance measure itself and the magnitude of nonparametric derivative decrease 
at a slower rate compared to the M/M/s case when the number of servers increases. Namely, the 
mean waiting time decreases from 4.5 when s = 20 to 1.9 when s = 60, while the nonparametric 
derivative decreases from 36.0 to 24.0. Correspondingly, the relative impact increases at a slower 
rate, from 8.0 to 12.7. This illustrates that the effect of model misspecification has a bigger absolute 
impact, but a smaller relative impact, on the waiting time in this large-scale G/G system than the 
Markovian counterpart. In real applications, the magnitude of r] is chosen to capture the statistical 
uncertainty of the input model from past data. 

Let us explain in more details about our estimation procedure. Note first that our performance 
measure of interest depends on both the interarrival and service times, but arrivals are auxiliary 
variables and hence the cost function can be written as Eo[h{'X.T,^T)\'^T], where denotes the 
sequence of service times and Yt as the interarrival times (see Section [3^ . Second, note also that 
Assumption 3 is easily satisfied since the cost function Eo[h{XT,YT)\^T] < Et=i(-^oM + Xt) 
and each Eq Yt] + Xt has exponential moment. Moreover, Assumption [4] is trivially verified by our 
computation that demonstrates g{X) is not a constant. Hence the assumptions in Theorem [2] is 
valid. 

From discussion point [T] in Section 3.2, we can write VarQ{g{X)) = VarQ{EQ[ShO^T,YT)\X]), 
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Mean Waiting Time in G/G/s Queue 
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Figure 2: First order approximation of the deviation from benchmark performance measure under 
different input model discrepancies in terms of KL divergence 





Benchmark performance measure 


Nonparametric derivative 


Relative impact 


Number of servers 


Mean 


C.I. 


Mean 


C.I. 




20 


4.522 


(4.427,4.616) 


36.012 


(33.918,38.106) 


7.965 


30 


3.816 


(3.725,3.906) 


31.187 


(29.132,33.242) 


8.174 


40 


3.062 


(2.981,3.142) 


29.125 


(27.655,30.596) 


9.512 


50 


2.796 


(2.718,2.874) 


26.770 


(24.958,28.581) 


9.574 


60 


1.888 


(1.820,1.956) 


24.044 


(21.411,26.677) 


12.737 



Table 2: Simulation results for performance measures and first order nonparametric derivatives for 
the mean waiting time of the 100-th customer in G/G/s systems with different server capacities 



where Sh{-) is the symmetrization sum over defined in (14). This quantity is therefore in the 
form of the variance of a conditional expectation, and we adopt an unbiased estimator from |50| 
that is based on analysis-of-variance (ANOVA) to estimate this quantity. This estimator takes the 
following form. For convenience, denote H := Sh(X.T, ^t), and we want to simulate Varo(E[H\X]). 
The idea is to do a nested simulation by simulating Xk,k = 1,. . . ,K, and then given each X^, 
simulating Hkj,j = 1, . . . , n. An unbiased estimator is 

= ^ E(^^ - Hf - (67) 



K _ 

fc=i 



where 

1 



ct2 



K n 1 " 1 ^ 

fc=i j=i j=i fc=i 



Next, to obtain a consistent estimate and confidence interval for y^2yaro(5(X)), we use the 
delta method and the technique of sectioning (see, for example, §111 in |3j). The sampling strategy 
is as follows: 
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1. Repeat the following times: 

(a) Simulate K samples of X, say = x^, k = 1, . . . ,K . 

(b) For each realized Xfc, simulate n samples of H given X = Xk- 



(c) Calculate a\j using (67). 



2. The above procedure generates estimators a\,^. Call them Zi^l = l,...,iV. The final 
point estimator is V'2Z, where Z = (1/-/V) Xli^i ^^"^ 1 — a confidence interval is 
V2 X (VZ ± {a/i2Vz))t^_^/2/y/N) where = 1/{N - 1) En=i(^n - Zf and ti_„/2 is the 
1 — a/2 percentile of the t-distribution with — 1 degree of freedom. 

This gives a consistent estimate for y^2Varo{g(X)) and an asymptotically valid confidence 
interval. In our implementation we choose K = 30, n = 10 and = 20. The choice of the inner 
loop sample size n is based on a pilot estimation (see §4 in |50j). 

We envision that sensitivity assessment of this sort can be applied across vast areas in engineering 
and management applications. As can be seen from the above example, the toolkit is applicable 
as long as the system can be simulated, and is methodologically identical for any input models, 
which provides great flexibility in applications. In inventory and revenue management for instance, 
it can be used to stress-test existing stocking or pricing policies and to potentially help redesign 
new policies that are robust to distributional uncertainty in demand and operation models. In 
transportation and other large-scale systems, it can be used to identify vulnerable system structures 
or components. Similar line of analysis can be applied to portfolio and risk management to help with 
investment and trading decisions. The full development of these applications shall be considered in 
future research. 



8 Discussions 

We close this paper by discussing a few aspects of our problem and further work. 
8.1 Other Statistical Distances 

We point out that KL divergence is not the only choice for developing nonparametric derivatives. It 
is however a natural choice and offers some advantages compared to other distances: It leads to the 
favorable exponential tilting form of the involved optimal change of measure L* , which also arises 
in large deviations and rare-event problems for light-tailed systems [llj . It also has an entropy 
interpretation, and has been used in model calibration and derivative security valuation in the 
finance context (see, for example, |l9]). The structurally neat form of the exponential change of 
measure appears to streamline the development of contraction maps and the involved convergence 
results shown in this paper. 

Using other statistical distances will change certain aspects in the form of nonparametric deriva- 
tives and the scaling of ry in the expansions. This arises from differences in the local behaviors of 
different statistical distances. As a comparison, for example, x^'distance (or Euclidean distance), 
given by x^{Pf, Pq) = Eq{L — 1)^ = Eo[L'^] — 1 will lead to an optimal change of measure satisfying 
the linear-type contraction map L{x) = 1 + {g^{x) - Eo[g^{X)])/{2a) in Proposition § On the 
other hand, the empirical likelihood or the "backward" entropy ^42j, given by E'o[logL^-^ will give 
a reciprocal linear form L{x) = 1/(A — g^{x)/a) where A is a constant that makes -Eo[-^] = 1- Such 
changes of measure will result in different coefficients and scaling in the expansions of the involved 
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optimal values. Nevertheless, it appears that the "symmetrization" functions g{x) and G{x,y) and 
the cumulant forms come up in the nonparametric derivatives universally for many members in the 
Csiszar /-divergence class (§2 in [5Tj). This phenomenon is intuitively consistent with results from 
von Mises' theory on the form of the asymptotic variance for statistical functions (§6 in [M]). We 
shall leave the thorough comparison of various statistical distances to future work, but shall point 
out that the requirement of finite exponential moment on the cost function, which by far is the most 
important assumption in our formulation, can potentially be removed by using other distances such 
as the a-divergence [131 122] ■ 

8.2 Simulation Output Analysis 

The nonparametric derivatives developed in this paper can potentially be applied to the construction 
of confidence intervals subject to input model uncertainty. When the input model is estimated 
from data, the construction of confidence intervals for simulation outputs consists of the statistical 
deviations for both simulation and data collection. This is known as trace-driven or historical 
simulation (§2.3.2 in |40j). To use the notation in this paper, suppose one is interested in estimating 
Eo[h(X.)] via simulation of size n. The exact distribution Pq for the input X is unknown, but an 
estimated distribution, call it Pf, is inferred from collected data of size m (the form of collected 
data that leads to Pj can vary from case to case). Using Pf, one gets Ef[h(X.)], and the simulation 
yields Ef[h(X.)]. The error involved will contain 

EflHX)] - Eo[h{X)] = {EflHX)] - Ef[h{X)]) + {Ef[h{X)] - Eo[h{X)]) 

where the first term is the error due to simulation and the second due to input uncertainty. Typically, 
then, the confidence interval for £'o[/i(X)] is Ef[h{X.)] ± Zi_a/2 /n + al/m where Zi_q/2 is 
the (1 — a)-th percentile of normal distribution. While the simulation variance af can be easily 
estimated, the input variance (t| is more involved, and sectioning (§111 in p]) can be used to estimate 
this value in some situations. 

Our nonparametric derivatives provide an alternate approach, by using the result that \Ef[h(X.)] — 
Eo[h{X.)]\ < y^2Varo{g{X))ri + 0{r]), for any Pf that satisfies D{Pf\\Po) < rj. A justified choice of 
rj, which typically scales with the amount of data, can help to construct valid confidence interval by 
combining with the estimation of Varo{g(X)). One possible way to do so is via empirical likelihood: 
since KL divergence is in the Cressie-Read family, a variant of the EL method (§3 in [¥2]) stip- 
ulates that, under certain regularity conditions, the associated profile likelihood is asymptotically 
X'^-distributed. This will give valid critical values for pre-specified false positive probabilities. Such 
development shall be left for future research. 

8.3 Other Formulations 

The formulation in this paper is the most stringent among a range of possible formulations in 
analyzing model misspecification. Namely, i.i.d. assumptions are held intact and uncertainty is 
only allowed on each identical individual distribution. On the other end of the spectrum is a 
formulation that allows freedom over both dependency and distributions among different random 
components (in a manner that is adapted to the filtration). In this case, the constraints on model 
misspecification in terms of KL divergence are given as EolLtlog Lt\J^t-i] < V for a sequence of 
likelihood ratios Lt on time steps t = 1, . . . ,T. When the cost function is linear, this is equivalent 
to multi-stage robust control problem. General studies on the formulations in between these two 
extremes appear to be open. 
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A Technical Proofs 

A.l Proof of Proposition [2] for General Time Horizon T 

We shall prove the several lemmas that lead to the proof of Proposition [2] for general time horizon 
T in Section [5. 3[ For convenience let C > be a generic constant. 



Proof of Lemma^ We prove the statement point-by-point regarding the operator IC. For conve- 
nience, denote Sh{X,X.T-i) = Sh{X, Xi, X2, ■ ■ ■ ,Xt-i), where is defined in (54), and L^^i = 
Y[t=i Lt{Xt). Also, denote /3 = 1/q>0, so/3— >-Ois equivalent to a — > 00. 



Well-definedness and closedness : Recall the definition of K in (55), which can be written as 

^l3Eo[Sh{X,:>LT-i)LT_-,\X=x]/{T-iy. 

K(L)ix) 



[g/3Eo [Sj. (^,Xt _ 1 ) 1 1 X] / (T- 1 ) ! ] 

for any L = (Li, L2, . . . , Lt-i) G £(M)-^^^. We shall show that, for any L G £(M)^~-^, we have 
< ^o[e/5^o[S,.(x,XT-i)LT-il^]/(r-i)!] < 00 and that K{1.) e C{M). This wih imply that, starting 
from any Li, L2, . . . , Lt-i G '^(-^)) we get a well-defined operator K and that Li, L2, . . . , Lt-i 
defined in (56) all remain in C{M). We then conclude that IC is both well-defined and closed in 
£(M)^-i by the definition in (pTl). 



Now suppose Li, L2, . . . , Lt-i G C{M). Consider 



= EQ[e^^^^^]e^^'^-^ Eo[K(x)Lt(x)] 

< 00. (68) 
This also implies that e'^^o[Sh(x,XT-^i)LT-im/iT-iy- < ^ a.s.. Similarly, 

^^[g/3Eo[5ax,XT-i)LT_i|X]/(T-l)!] > ^^[g-M(X)]g-/3(T-l)M > q_ (gg) 

Hence K is well-defined. To show closedness, consider 

^l3Eo[ShiX,yiT-i)LT_i\X]/{T-iy. 



Eo[AiX)K(LT-i)iX)]=Eo 



MX) 



\X]/{T-1) 



Eo[AiX)e^^W]e'^(^-'^^ 

E^[e-mx)] ■ ^^^^ 

Since Eo[A{X)e'^^^^^] Eo[A{X)] and Eoie-I^^^^^ ^ 1 as /? ^ 0, ([70| is bounded by M for small 
enough f3, if we choose M > ii^o[^(-^)]- Hence K is closed in C{M). 

By recursing using (56), we get that K. is well-defined, and that for any L = (Li, . . . , Lt-i) £ 
£(M)^-i, we have maxt=i,...,T-i EQ[A{X)Lt{X)] < M, and so £ is closed in C{My-^. 
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Contraction : Consider, for any L = (Li, . . . , L^^i), L' = {L[, . . . , L'rp_^) e C{M) 



E,[{l + k{X))\K{l.){X)-K{l.'){X)\] 

^PEo[Sh{X,y.T-i)LT_M]/{T-l)\ ^pEo[sax,y.T-i)L'T-i\x]/iT-i)\ 

^^[g/3i?o[5^X,XT-i)LT-il^]/(T'-l)!] ~ [e/3i?o [5h (X,Xt_ 1 ) _ 1 1 X] / (T- 1 ) ! J 



(1 + A(X)) 
(1 + A(X)) 



j^|-g/3£;o[Sh(X,XT_i)Lj,_i|X]/(T-l)! _ g/3£;o[5fc(X,XT_i)Li,_i|X]/(T-l)! 

6 



i(^Q[e/9£;o[Sh{X,XT-i)LT-il^]/(7^-l)!] _ _£;^[g/3i?o[5h(X,XT-i)L^_i|X]/(r-l)!j^ 



(71) 



by using mean value theorem, where (^1,^2) hes in the hne segment between {^e^^o^^fi^^^'^T~i)Lj,_^\x]/{T 1)!^ 

^^[g/3i<;o[5h(X,XT„i)Lr_i|X]/(T-l)!]^ (^g/3£;o[Sh(X,XT-i)Li,_i|X]/(T-l)!^^^jg/3£;o[5h(X,XT-i)L^_i|X]/(r-l)!j^_ 

By (69), we have ^2 > 1 — e for some smaU e > 0, when (3 is smah enough. Moreover, ^1 < 
g/3(Aprr+(r-i)Af)^ Hence, is less than or equal to 



,PEo{Sh{X,-iLT-i)LT,_.,\X]/{T-l)\ _ BEo[Sh{X,^T-i)IJT-i\X]/{T-l)\ 



< 



+ sup 

1 



(l + A(X))(^sup 

^^[g/3£;o[S;,(X,XT_i)Lj,_i|X]/(T-l)!^ _ _g^^g/3ii;o[5h(X,XT-i)Li,_i|X]/(T-l)!j 



1 - e 

g/3{r-i)M 

+ 



(1 + A(X)) 



,/3i?o[5h(X,XT_i)LT_i|X]/(T-l)! _ /3i?o[5h(X,XT-i)L^_i|X]/(T-l)! 



(1-6) 



(1 + A(X))e^'^(^) _g^[g/3£o[5h(X,XT_i)L:r-il^]/(r-l)!] 



^ rg/3£;o[5h(X,XT_i)L;f,_JX]/{T-l)!n 



mx) 



JEo[Sh(X,JiT-i)Lj,_^\X]/{T-iy. _ BEo[Sh{X,JiT-i)Lir^i\X]/{T-iy. 



p2/3A(X) 



( 1 + A (X) ) ,^ I ^0 [5/. (X, Xr _ 1 ) 1 1 X] - ^0 [5/, (X, Xt» 1 ) 1 1 X] I 



(T-l)! 
by mean value theorem again 



< C(3Eo 

< CpEo 



p2/3A(X) 



(1 + A(X))— — |5;,(X, Xr-i)||Lr-i - ^T-il 



(T-l)! 



T-l 



(1 + A(X))e^''^(^) A(X) + ^ A(X,) - 



C/3 i?o[(l + A(X))2e2/^^(^)]i?o|^T-i - ^T-il + ^o[(l + A(X))e2/^^W] 



T-l 



^A(Xi)|L^_i-L^_i| 



t=i 



(72) 
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when (3 is small enough (or a is large enough). Now note that 



\Llt-i ~ Llt-i\ — Llt-i\-^s{^s) ~ Lg(Xs 



s=l 



where each Lj^-i ^ product of either one of L{Xr) or L'{Xj.) for r = 1, . . . , T — 1, r 7^ s. Hence 



( 72 ) is less than or equal to 



T-l 



CP E^[{1 + A(X))2e2/'^(^)] Eo\Ls{X) - L'^{X)\ + Eo[{l + A(X))e2/^^W] 

\ s=l 

/ \ 

Eo[AiX)\LtiX) - L[iX)\] + Eo[AiX)LiX)] Yl ^ol^.(^) - 4(^)1 



T-l 

E 

i=l 



V 



=1,...,T-1 



(73) 



Now for convenience denote yt = Eo[{l + A{X))\Lt{X) - L[{X)\] and yt = Eo[{l + A{X))\Lt{X) 



L[{X)\], where Lt is defined in (56). Also denote y = {yi, . . . , yr-i) and y = (yi, . . . , yr-i)- Then 
((73]) gives yi < al'y for some a := a(/3) = 0(/3) as /? — ?■ 0, and 1 denotes the (T — 1) -dimensional 



vector of constant 1. Then (56) implies that 



Hence 



m < al\yi,y2, ■ ■ ■,yT-i) 
m < al'{yi,y2,y3, - ■ ■ ,yT-i) 

yr-i < al'{yi, . . . , yr-2, yr-i)- 

T 

max yt < max Atsys 

t=l,....T-l t=l,...,T-l^ 

s=l 

where Ats ■= Ats{f3) are constants that go to as /? — )• 0. Therefore, when /3 is small enough, 
d(£(LT-i),£(L 

T~i)) — ''^d{'LT-i,l''x-i) for some < ui < 1, and /C is a contraction. Moreover, as 
a consequence, starting from any initial value L*^"*^) = L G ClM)"^^^, the recursion L^'^'"*'"'^^ = ^(L^^^) 
satisfies L^'^) L* where L* is the fixed point of IC. 

Identical Components: It remains to show that all components of L* are the same. Denote L* = 



{LI, . . . , L^_-^). By definition /C(L*) = L*. Using (56), we have 



Li = KiLl,...,L*T^^) = Ll 

L2 = K{Li,L*2, . . . , Ly-i) = ^{^11^2, . . . , -Z^T-i) = -^2 

L3 = K{Li, L2, L^, . . ■ ,L^-i) = K{Ll, L2, L^, . . .,L^_i) = L3 



Hence = 



Lt-i = K{Li, . . . , Lt-2, Lx-i) = K{Ll, . . . , L^, L^-i) = L^-i 
= ■ ■ ■ = L^_i = Li = ■ ■ ■ = Lt-1 = KiL\, . . . , L'^_i)- This concludes the lemma. 



□ 
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Proof of Corollary^ Note that K. has a fixed point in £(M)^~^ that has all equal components. 
Since convergence in C{M)^^^ implies convergence in each component in C{M) (in the || • ||A-norm), 
the result follows. □ 

Proof of Lemma The proof of this lemma follows the same idea as Lemma |3j with the condi- 
tioning now imposed on the variable that corresponds to the smallest index of L among the T — 1 
candidates in consideration. More specifically, using the result in Proposition [l] and the symmetry 
of Sh, we have 



Eq 



En 



t=i 



a{T - 1)! E^[L^^^'-^^ log h'^^+'-^h 



t=i 



En 



t=2 



a(r- 1)!Eo[l('=) logL^'^)] 



a 



(T - 1)! ^ i?o[i^'+*"'^ logL('=+*-i)l 



t=2 



< En 



En 



T 



t=2 



a 



(r-l)!i?o[L('+^) logL(^+^)] 



T 



a{T - 1)! ^ Eo[L^''+'-'^ log L^'^+^-i)] 



En 



t=2 
T 



SH{:^T)J{L^^^'\xt) 



t=l 



a(T - 1)! 5^ Eo[L^''+^^ log 



t=i 



□ 

Proof of Lemma^ The proof follows closely that of Lemma [4j We consider convergence of the first 
and second terms of (59) separately. For the first term, consider 



En 



< En 



T 



SH{^T)llL<^''^'-'HXt 



t=l 

T 



Eo 

T 



T 



Sh{y^T)\[L''{Xt 



]^^(^+*-i)(X,)-J]L*(X, 



t=i 



< {T-l)\Eo 



t=i 



where each Lrp is product of either one of L^''+''-'^\Xr) or L*{Xr) for r = 1, . . . ,r, r^s 
T 

< CJ2Eo[{1 + A(X))|l(^+^-i)(X) - L*{X)\] 



(74) 



as A; — )■ oo, since lC^) ^ L* in II • lU-norm by Corollary^ 
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We now consider the second term in (59). By the recursion of we have, for A; > 1, 



\E^[L^^+^-^hogL^''+^-^^ - Eo[L* logL*]| 



(T-1)! 



En 



T 



(T-1)! 



En 



t=i 

T 

Sh{'KT)\{L*{Xt) 



t=i 



log [e'^^o 1 ^^^^ - 1 1 ^1 / ^ ) ■ ] 



/3 



(T-l)! 



En 



t=i 



En 



t=i 



+ I logEo[e'^'^°'^'^^^'^^-'^n*=i'-^''^*~^'^^*^'^'^^^"^^'] - logEo[e'^^''>^^''^^'^T-^'>^T-i\^]/iT~iy-j^Y5) 



The first term in (75) converges to by the same argument as in (74). For the second term, we can 



write, by mean value theorem, that 

I log£;o[e^^"['^''(-^'^^-i)nLYi^'°+'"^'(^OI^]/(T-l)!] _ jQg^^[g/3i?o[S;,(X,XT-i)L?,_i|X]/(T-l)! 



6 



\Eo[e 



fiEo[SUX,y.T-i)nJjiL(>'+^-''HXt)\X]/{T-iy. 



] - ^Q[e/3£^o[5h(X,XT-i)L?,_i|X]/{T-l)!j| (^7g^) 



where 6 lies between i?o[e/3i?o[5.(x,x^_i) nLY L(>^+'-'HXt)\x]/(T-iy.^ E^^^iSEo[SUX,^T-i)iA_,\x]/iT-iy. 
and hence > Eo[e~ ^^^^^je^^'^^^^'^^^^'^^^ > 1 — e for some small e > 0, when f3 is small enough. 
Moreover, 



\Eo[e 



PEo[Sh{x,y.T-i)Y\J^^ L^''+'-^\Xt)\x]/{T~iy. 



] - _E;o[e''^o[^''(^'^T--i)^T-il^]/(T'-i)!]| 



T-l 



X{L^^+'-^){X,)-lJ,_, 



t=i 



for some 6 lying between Eo[Sh{X, Xy-i) Uj=i L(''+'-^HXt)\X]/{T-l)l and Eo[Sh{X, Xt-i)L^_i\X]/ {T- 
1)!, and hence ^2 < ^{X) + (T — 1)M. So, much like the argument in proving Lemma |4| (76) is 
less than or equal to 

C^^^max_^Eo[{l + A{X))\L^'+'-'\x) - L*(X)|] ^ 



as /c — )• 00. This concludes the lemma. 



A.2 Proof of Theorem [2] 



□ 



We shall prove the several asymptotics discussed in Section 5.4 in order to prove Theorem [2] Denote 
(3 = 1/a* > 0, so /3 is small when a* is large. Then from (41) we have 



L*(x) 



Eo[eP9^'(^)] 



(77) 



where g^* {x) = Ylt=i9t*i^) = Yli=i Eo[h{:S.T)Ui<r<T L* {Xr)\Xt = x]. Also recall that 



g{x) = Y,9t{x) = Y,E^[h{y.T)\Xt = x] 



t=i 



t=i 
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as defined in Q. Note that Eo[g{X)] = T£'o[/i(Xt)]. Furthermore, let us denote, for any p > I, 
0{/3P) := 0{pP;x) as a deterministic function in x such that £;o[^(Xt)'?0(/3P; X*)] = 0{pP) for any 
g > 1 and t = 1, . . . ,r, when /3 ^ 0. Finahy, we also let := log £;o[e''^'^^^^] for convenience. 

We first give a quadratic approximation of L* as /3 — (equivalently a* — t- oo). Then we find 
the relation between /3 and rj, which verifies the optimality condition given in Proposition [3| After 
that we expand the objective value in terms of (3, and hence r], to conclude Theorem [2j 

Asymptotic expansion of L*: We shall obtain a quadratic approximation of L* by first getting 
a first order approximation of L* and then iterating via the quantity to get to the second order. 
Note that as the "logarithmic moment generating function" of g^* {X), 



V'L*(/3) = logi?o[e^^'*('')] 



(78) 



where «2(5^*(X)) := Eoiig"^* {X) - E.lg^" {X)])^] and K3(g^*(X)) := [{g^\X) - E^[g^' {X)])% 
Using (77) and (78), and the finiteness of the exponential moment of g^* {X) guaranteed by a 



calculation similar to (68), we have 

L*{x) = 



l + /3(/*(x)-i?o[/'(X)]) + 0(/32) 



(79) 



But notice that 



g^\x) = '£E, 



t=i 



t=i 

T 



l<r<T 



Xt = X 



/i(Xt) n (l + 0(/3;X,)) 



l<r<T 



Xt = x 



= Y,Eo[h{y.T)\Xt = x]+0{P) 
t=l 

= g{x) + 0(/3) 

and hence EQ[g^' {X)] = EQ[g{X)] + 0(/3). Consequently, from ([79]) we have 

L*{x) = 1 + /3(5(x) - Eo[g{X)]) + 0{(3^) 



(80) 



This gives a first order approximation of L* . Using ( 80 ) , we strengthen our approximation of g 
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to get 



/*(x) = ^i?o 



t=l 



hi-Xr) n il + m^r)-Eo[giX)]) + 0{f3'')) 



l<r<T 



Xt = X 



+ E E^[K^T){9{Xr)-E^[g{X)])\Xt = x\ + 0{f) 



t=l l<r<T 

g{x) + f]W{x) + 0{P^) 



(81) 



where we define W{x) := Et=i Ei<r<T -Eo[/i(Xt)(5(X^) - Eo[g{X)])\Xt = x]. With (jsTj), and 



using ( 78 ) again, we then strengthen the approximation of L* to get 



JgL* {x)-^L, (/3) _ 8(gL* ix)-Eo[g^* (X)])- ^Eo[{g^* {X)-Eo[g^' (X)])2]+0(/33) 



/3 



l + /3(5^ ix)-Eo[g'^ W]) + y[(5^ (^)-i5^ob^ {X)])' - Eo[ig^ (X) - Eo[g^ (X)])']] 
+ 0{f) 

1 + /3(<7(x) - ii;ob(^)]) + f \w{x) - Eo[W{X)] + ]^{{g{x) - Eo[g{X)]f 



-E,[{g{X)-EMX)]?])\+0{fi^) 
= 1 + p{g{x) - EQ[g{X)]) + fV{x) + 0{^^) 

where we define V{x) := W{x) - Eo[W{X)] + - Eo[g{X)]f - Eo[{g{X) - Eo[g{Xm). 



(82) 



Relation between /3 and r/: By substituting L* depicted in (77) into r/ = Eq[L* logL*], we have 
V = Eo[L*\ogL*] = f3Eo[g''\X)L*iX)]-logEo[efs'*^''^] = l3TEo[h{y.T)m - i'L-iP) (83) 



Using ( 78 ) , we can write ( 83 ) as 



PTEo[h{XT)m - /3^o[/*(^)] - ^K2(/*(X)) - ^K3(/*(X)) + 0(/3^) 



t=i 
+ 0(/3^) 



h{XT) n L*iXr)iL*{Xt)-l) 



l<r<T 



^«2(/*(X))-^K3(/*(X)) 



(84) 
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We analyze (84) term by term. For the first term, using (82), we have 



t=i 

T 

t=i 



/i(Xt) n L*iXr)iL*{Xt)-l) 



l<r<T 



HXt) n {l + mXr)-Eo[g{X)]) + f3^V{Xr) + 0{P^)) 



l<r<T 



imXt) - EoigiX)]) + /3V(Xi) + 0(/33)) 



T 



t=i 

T 



E Eo[h{-KT){g{Xr) - Eo[g{Xmg{Xt) - Eo[g{X)])] 



t=l l<r<T 



+ YEQ[h{XT)V{Xt)] 



t=i 



f3VaroigiX)) + p^[u + EMX)V{X)]] + 0{p^ 



where v is defined in (11). The last equality follows since 

T T 



Y,Eo[h{'KT){g{Xt) - E^[g{X)])] = Y,Eo[Eo[h{-KT)\Xt]{g{Xt) - Ea[g{X)])] 
t=i t=i 

T 

Y,Eo[gt{X){g{X) - Eo[g{X)])] = Eo[g{X){g{X) - Eo[g{X)])] = Varo{g{X)), 



t=i 



(85) 



Y Eo[hO(.T){g{Xr) - EoigiXMgiXt) - EolgiX)])] 

t=l l<r<T 
T 



Y E Eo[Eo[h{XT)\Xr,XtMXr) - Eo[g{Xmg{Xt) - Eo[g{X)])] 

t=l l<r<T 

Eo[G{X, Y){g{X) - Eo[g{X)]){g{Y) - EMy)])] = ^ 



where G{X,Y) is defined in (12), and 

T T 

Y,EQ[h{-}^T)V{Xt)] = Y,E^[E^[h{^T)\Xt\V{Xt)] = EMX)V{X)]. 



t=i 



t=i 



For the second term in (84), by using (81), we have 
K2 (/* {X)) = Eoiig"^' (X) - Eo [/* {X)]f] 



EoiMX) - Eo[giX)]) + l3iW{X) - Eo[W{X)]) + OiP^;X)f] 
Varo{g{X)) + 2(3EoMX) - Eo[g{XmW{X) - Eo[W{X)])] + 0{f3^) 



(86) 
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Now notice that W{x) can be written as 



Wix) = Y^ Eo[hiXT)igiXr) - Eo[giX)])\Xt = x] 

t=l l<r<T 
T 



Eo[Eo[h{XT)\Xr,Xt]{g{Xr) - Eo[g{X)])\Xt 

t=l l<r<T 

Eo[G{X,Y){g(Y) - Eo[giY)])\X = x] 



where G{X,Y) is defined in (12). Hence 



EoiigiX) - Eo[g{XmWiX) - Eo[W{X)])] = EoMX) - Eo[g{X)])WiX)] 
Eo[{g{X)-Eo[g{X)])G{X,Y){g{Y)-Eo[g{Ym = v. 



Consequently, (86) becomes 



yaro(5(X)) + 2/3z. + 0(/32). 



Finally, for the third term in (84), we have 



AC3(/*(X)) = E^\{g{X) - E^\g{X)\f\ + 0(/3) = K^{g{X)) + 0(/3). 



(87) 



(88) 



Combining (85), (^87| and (88), we have 



ry = /?Varo(5(X)) ^ fi^ ^ E^\g{X)V{X)\\ - '^^V av^{g{X)) - p^u - ^K^{g{X)) + 0{P^) 



yaro(5(^))+/3=^ 



EMX)V{X)] - Jk3(<7(X)) 



+ 0(/3'). 



(89) 



Under Assumption [4], we can invert (89) to get 

' 2P{E^[g{X)V{X)] - {l/b)K^{g{X))) 



/3 



' 2r/ 
Var^igiX)) 

Varo{g{X)) 2 V Varo{g{X)) 



1 + 



1 



-1/2 



Varo{g{X)) 
Yr, 2(5{Eo[g{X)V{X)] - (1/6)k3(<7(^))) 



Var^igiX)) 



+ 0(r?i/2;32) 



2ri 



Var^igiX)) 



2r^{EMX)V{X)] - (l/6)K3(ff(X))) 
{VaT,{g{X))f 



+ 0(7? 



3/2N 



(90) 



This in particular verifies the condition in Proposition [3} i.e. for any small rj, there exists a large 
enough a* > and a corresponding L* that satisfies ( |60[ ). Consequently, L* for some large enough 
a* is an optimal solution to ( 33 ) . 



Relation between the objective value and /3, and hence ry: Using (82) again, the optimal 
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objective value in (33) can be written as 



EQ[h{y.T)m 



T 



hiXr) + f^iaiXt) - Eo[g{X)]) + P^V{Xt) + 0{^^; Xt)) 
t=i 

T 

i?o[/i(XT)] + /3 5ZSo[MXr)(9(^t) - ^ob(^)])] 

T T 

Eo[h{XT){g{Xr)-Eo[g{Xmg{Xt)-Eo[g{X)])] + YEo[h{^T)V{Xt^ 



+ /3 



t=l l<r<T 
r<t 



t=l 



Eo[h{XT)] + /3Varoig{X)) + 1^^ + Eo[g{X)V{X)]\ + 0{p^) 



(91) 



where the last equality follows from similar argument in (85). Finally, substituting (90) into (91) 
gives 



EoHXt)] + V'^Varo{g{X))r^ + ^^Jj^^^^ 

= Eo[hi^T)] + V2VaMgiX))rj+ ^^^J^^^^^ 
which coincides with Theorem [2j 



-Eo[giX)V{X)] + ^KsigiX)) + ^ + Eo[g{X)ViX)] 



K3{g{X)) + u 



+ 0(7?3/2) 



A. 3 Proof of Proposition [3] under Assumption [6] 

We shall use a truncation argument, which involves leveraging results from finite horizon prob- 



lems. We begin by observing that the operator /C introduced in (53) possesses similar contraction 



properties as the operator /C in (57) in the following sense: 



Lemma 8. With Assumption ^ o n the cost function /i(Xr), for sufficiently large a, the operator 
fC : C{M) — )• C{M) defined in (53) is well-defined, closed, and a strict contraction in C{M) under 
the metric induced by \\ ■ ||a. Hence there exists a unique fixed point L* £ ^{M) that satisfies 
JC{L) = L. Moreover, L* is equal to each identical component of the fixed point of IC defined in 
(l57l). 



Proof. We shall utilize our result on the operator IC in Lemma [sj It is easy to check that given 
L S C{M), IC acted on L has the same effect as the component- wise operation K, defined in 
( [SS] ), acted on (L, ...,L) G £(M)^~^. In the proof of Lemma [E] we have already showed that 
K{Li, . . . , Lt-i) for any (Li, . . . , Lt^i) € C{M) is well-defined, closed, and a strict contraction 
under || • ||a, when a is large enough (or /3 = 1/a is small enough in that proof). These properties 
are inherited immediately to the operator IC. 



Next, note that (41) is the fixed point equation associated with IC. Moreover, we have already 
shown in Proposition^ that the same equation governs the fixed point of ^, in the sense that the 
T — 1 components of its fixed point are all identical and satisfy (41). By the uniqueness property 



of fixed points, we conclude that the fixed point of IC coincides with each identical component of 
the fixed point of ^. □ 
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Now consider a cost function /i(Xt-) with a random time r that satisfies Assumption [6| Again 
let /3 = 1/a > for convenience. We introduce a sequence of truncated random time t AT, and 
define ICt : C ^ C and /C as 



and 



where 



and 



/C(L)(x) 



T 



t=i 



t=i 



Here L* = Or =i,....s.r^t -^i-^r)- In other words, /Cy is the map identical to /C except that r is 
replaced by t AT. 

We first need the following proposition: 

Proposition 5. Suppose Assumption\^ is in hold. For /? < e for some small e > 0, both ICt, for 
any T > 1, and fC are well-defined, closed and a strict contraction with the same Lipschitz constant 
on the space C equipped with the metric induced by the Ci-norm ||L — L'||i := Eq\L — L'\. 



Proof. We first consider the map /C. Recall that by Assumption |6j we have |/i(XT-)| < C for some 
constant C > 0. Consider 

g/3Et"^i£'o[/i(XOi^;T>t|Xt=x] < ^CI3Y.T=iEo\lJ:,;T>t\Xt=x] 

= eCPT,Zi Eo[LUr>t] gi^pg > ^) independent of Xt 

= f,ci3Y,ZiPoir>t) g-j^^g ^ -g independent of {Xt}t>i 



Similarly, 



< oo by Assumption [6} 



(Et=iSo[^(Xr)L*;r>t|Xt=x] > ^-CpEoT ^ Q 



(92) 



(93) 



Therefore fC is well-defined and also closed in £,. To prove that /C is a contraction, consider, for any 
L,L' G C, 



Eo\}C{L) - )C{L')\ = Eo 
< Eo 



1 

6 



sup 



(94) 
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by mean value theorem, where (^1,^2) hes in the hne segment between {e^^^^'^\ EqIc^^^^^'']) and 
(e/3s^'(^),^o[e/39^'{^)]). By and ([93]), we have Ci < e^'^^"^ and 6 > e-^l^^or ^.g.. So is 
less than or equal to 

2e3C/3Sor^^|g/3,^(x) _ < 2e3^^^o-/3i?o|el - (95) 

by mean value theorem again, where ^ lies between e^^^^'^'^ and e^^^ and hence ^ < e'-'^^°'^ a.s.. 
Therefore (95) is further bounded by 

00 



(96) 



t=i 



Conditioning on r, — < X]^^;^ ^ ^.^^ i^'^l^s — L'g\ where L^'* is a product of either one 
L{Xr) or L'{Xr) for r = 1, . . . , r, r / t, s. Since r is independent of {Xt}j>i and {Xt}t>i are i.i.d., 
we have £'o[|i£?- — i^r*!!''"] <(''" — l)-E'o|i^ — L'\. Consequently, (96) is bounded by 

00 

2e4C/3£or^^_g^[^ - l;r > t]Eo\L - L'\ = 2e^^^^''^ PEq[t{t - 1)]Eq\L - L'\ < wEq\L - L'\ 
t=i 

for some w <\ when /3 is small enough. 

Finally, we note that the above arguments all hold with r replaced by r A T, in the same range 
of /3 and with the same Lipschitz constant w. This concludes the proposition. □ 

Next we show that ICt — )• pointwise on C: 

Proposition 6. Suppose Assumption^ is in hold. For any L £ C, we have K-j-iL) — )• IC{L) in 
II • 111, uniformly on f3 < e for some small e > 0. 



Proof. Consider 

EoIiCtIl) - k:{l)\ = Eo 



EolePa""^^^)] EQ[eP9Hx)] 
00 

< 2e^^''^°^/3 J^i?o|/i(X.AT)LtAT^(^ AT > t) - h{Xr)l/,I{T > t)\ (97) 



t=i 
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by an argument similar to (96). Now consider 

oo 

^£o|/i(X,AT)^tAT^(^ AT > t) - /i(X,)Lt/(r > t)\ 
t=l 

oo 

= Y.Eo[\h{Xr)lll{T >t)- h{Xr)l/^I{T > t)\;T< T] 
t=l 

oo 

+ ^Eo[\h{XT)lJrI{T >t)- h{Xr)L'rI{r > t)\;T> T] 
t=i 

oo 

= Y.Eo[\h(XT)LirI{T >t)- h{-Kr)lIrI{r >t)[^T>T] 
t=i 

T oo 

= 5^£;o[|/i(XT)L^-/i(X,)Lt|;T>r]+ £^o[|MXr)|Lt;r >i] 

t=l t=T+l 
T oo 

< ^i?o[|MXT)||^^-^tl + l^(XT)-MX.)|Lt;r >r] + C7 ^ Po(r > t) 



i=l t=T+l 



< 3CTPo{t>T) + C J2 ^o(r>i) 

t=T+l 





as T — 7- oo, since Eqt < oo. Hence (97) converges to uniformly over /3 < e for some small e > 0. 
This concludes the proposition. □ 

By a simple argument on the continuity of fixed points (see, for example, Theorem 1.2 in |10j). 
Proposition |6] implies the following convergence result: 

Corollary 2. Suppose Assumption^^ is in hold. Let L^'^'^ and L* be the fixed points of K-t and K,. 

For small enough fi, we have L^'^^ — ^ L* . 



Finally, we show that L* is the optimal solution to the Lagrangian relaxation of ( 63 ) : 



Proposition 7. Under Assumption\^ the fixed point L* of the operator K, maximizes 

Eo[h{y.r)L,]-aEo[L\ogL\ 

when a is large enough. 

Proposition [7| will initiate the same asymptotic expansion as in Section 5.4 which will give rise 
to the first and second order nonparametric derivatives in Theorem [3| 

Proof of Proposition^ We use the fact that for any fixed T, L^^^ is the optimal solution to 
EQ[h{X.rAT)LTAT] ~ Q^^o log -i^] , as a direct consequence of Proposition [2] Hence we have the 
inequality 

Eo[/i(X,AT)^.Ar] - oiE^X^ log L] < Eo[/i(X.at)^^'^^t] " a^o[^^^^ log L^^\ (98) 

for any L € £ (since h is bounded we can merely replace C{M) by £, i.e. putting M = oo, in 
Proposition [2]) . Here ( 98 ) holds for any T > 1 for a uniformly large (the uniformity can be verified 
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using Proposition [5] and repeating the argument in the proof of Lemma Tj noting that t AT < t 
a.s.). Our main argument consists of letting T — )• oo on both sides of ([98). 

We first show that, for any L £ C, the first term on the left hand side of (98) converges to 
Eo[h(X.r)LT-]. Consider 

\Eo[h{Xr;,T)L,^T] " Eo[h{Xr)L,]\ 



= X;So[/i(Xt)LjPo(r AT = t) -f2Eo[H^t)L,]Po{T = t) 

t=i t=i 

T-l 

< CY,\Po{TAT = t)-PQ{T = t)\+C\Po{T>T)-Po{T = T)\+C \Po{rAT = t)-Po{T 



t=i 



t=T+l 



CPo{r>T) + C Po{r = t) 

t=T+l 





as r — )• oo. Hence the left hand side of (98 ) converges to EolhCKr)!^^] — aEolL log L] for any L € £. 
Now consider the right hand side. For the first term, consider 

|i?0[/l(X,AT)^ST] - Eo[h{Xr)Ll]\ 

oo oo 

= Yl Eo[h{^t)LP]Po{T AT = t)-Y Eo[h{Xt)L;]Po{T = t) 



t=i 

T-l 



t=l 



< CYEo\l'^^ -Ll\Po{T = t) + 2C{Po{T>T) + Pir = T)) + C ^ Po{t = t) 



t=i 

T 



t=T+l 



< tPo(r = t)Eo\L^^^ -L*\+ 2C(Po(r >T)+P{t = T)) + CPo{t > T + 1) 



t=i 



by the argument following (96) 
= CEo[t; t < T]Eo\L^'^Hx) - L*{X)\ + 2C(Po(r > T) + P(r = T)) + CPo{t > T + 1) 
^ 0. (99) 
Moreover, for the second term in ( |98[ ), write 



E,[L^^) logL(^)] = pEo[h{^r^T)l}:>^] - logi?o[e^^ ' 



and 



Eo[L* logL*] = pE^[h{^r)m - \ogEo[e^s (^)] 
by the definition of the fixed points for K,t and /C. To prove that Eq[L^^'^ logL(^)] ^ Eq[L* logL*], 



we have to show that EQ[h{^T-/\T)LL. 



AT) 



(T) 
-ATJ 



£'o[/i(Xt-)L*], which is achieved by (|99[), and that 



log^o[e''^ (^)] ^ logSo[e'^s (^)], which we will show as follows. Consider 



logi?o[e''^''"''''(^)]-logi?o[e^^"(^) 



'(T) 



< e^^^°''\Eo[e'^3 ' (^)] - Eolef^^ (^)]| by mean value theorem and the bound in (|93 

< e2^^^«^/3 i?o|/i(X,AT)4AT I{rAT>t)- h{^rW,'l{T > t)\ 



(100) 



t=l 



by arguments similar to ( 96 ) and ( 97 ) . 
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Now 

oo 



^£;o|/i(X.Ar)^;?T I{r^T>t)- h{Xr)Ll'l{T > t)\ 
t=i 

oo 

Y,Eom^r)LP'l{T >t)- h{Xr)L*'l{T > t)\;T< T] 
t=l 

+ ^^o[|^(Xt)^?'^ I{T >t)- /i(X,)L;*/(t > t)|;r > T] (101) 



Note that the first term in (101) is bounded by 



CY,Eo[\lP' -Ll%T>t,T<T] < C^Eo[T-l;t<T <T]Eo\L^^^ -L* 
t=i t=i 

by the argument following ( |96[ ) 

= CEo[t{t - 1); r < T]Eo[l(^) - L*] 





since Eqt < oo. The second term in (101) can be written as 

T ^ oo 

Y,Eo[\H^t)lP -h{Xr)L*'liT>t)\;T>T]+ Eo[\hiXr)\L*';T>t] 
t=l t=T+l 
T ^ oo 

< YEo[\h{XT)\\LP -L*'\ + \hiXT)-h(Kr)\Ll';T>T] + C ^ Po(t > t) 
t=i t=r+i 

oo 

< 3CTPo{t>T) + C J2 Mr>t) 

t=T+l 





since Eqt < oo. We therefore prove that (100) converges to and the right hand side of (98) 



converges to Eo[h(X.r)L*] — aEQ[L* logL*]. This concludes the proposition. □ 
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